This page describes the assessment criteria for scoring each repair tool per repair task. The competition evaluates on three aspects the efficacy, response time and resource usage.
Each tool will be executed for each track they have registered with, for each repair task the tool will generate at most 5 patches in unified diff format which can be applied to the original program. Each patch will be applied to the original program and executed against following set of test suites:
Label | Build Success | Public Tests | Private Tests | Adversarial Tests |
---|---|---|---|---|
Invalid | failed | - | - | - |
Incorrect | success | failed | - | - |
Over-fitting | success | success | failed | - |
Correct | success | success | success | failed |
High Quality | success | success | success | success |
For each repair task we capture the minimum time taken to produce a result to the user. A result can be of two formats, generating a plausible patch or an output stating cannot find a patch.
For each repair task the memory usage, cpu usage and gpu usage to produce a result is captured and will be used to evaluate the efficiency of the tool.
For each task, we aim to offer the highest reward to those tools which prioritize the generation of only correct and high quality solutions. To this purpose we shall penalize invalid, incorrect, and over-fitting patches.
For each track, the candidate tools are ranked according to the total score per track. Candidates with the same score are further differentiated according to the Response Time. In case there is still a tie, we can further differentiate according to their Resource Usage Footprint.