APR-Comp 2024 | International Competition on Automated Program Repair

Evaluation

This page describes the assessment criteria for scoring each repair tool per repair task. The competition evaluates on three aspects the efficacy, response time and resource usage.

Efficacy Assessment

Each tool will be executed for each track they have registered with, for each repair task the tool will generate at most 5 patches in unified diff format which can be applied to the original program. Each patch will be applied to the original program and executed against following set of test suites:

Public Test Suite
test cases that are provided to the tool with the repair tasks
Private Test Suite
we generate additional test cases using the reference program (deemed correct version) which are not provided to the repair tool during the repair process
Adversarial Test Suite
during the rebuttal phase (ref submission) competing teams can provide additional test-cases to invalidate competing team patches, we curate a set of test cases from each participating team

Label	Build Success	Public Tests	Private Tests	Adversarial Tests
Invalid	failed	-	-	-
Incorrect	success	failed	-	-
Over-fitting	success	success	failed	-
Correct	success	success	success	failed
High Quality	success	success	success	success

Response Time Assessment

For each repair task we capture the minimum time taken to produce a result to the user. A result can be of two formats, generating a plausible patch or an output stating cannot find a patch.

Resource Usage Assessment

For each repair task the memory usage, cpu usage and gpu usage to produce a result is captured and will be used to evaluate the efficiency of the tool.

Scoring Criteria

For each task, we aim to offer the highest reward to those tools which prioritize the generation of only correct and high quality solutions. To this purpose we shall penalize invalid, incorrect, and over-fitting patches.

Ranking Criteria

For each track, the candidate tools are ranked according to the total score per track. Candidates with the same score are further differentiated according to the Response Time. In case there is still a tie, we can further differentiate according to their Resource Usage Footprint.