Evaluation


This page describes the assessment criteria for scoring each repair tool per repair task. The competition evaluates on three aspects the efficacy, response time and resource usage.

Efficacy Assessment

Each tool will be executed for each track they have registered with, for each repair task the tool will generate at most 5 patches in unified diff format which can be applied to the original program. Each patch will be applied to the original program and executed against following set of test suites:

  • Public Test Suite
    test cases that are provided to the tool with the repair tasks
  • Private Test Suite
    we generate additional test cases using the reference program (deemed correct version) which are not provided to the repair tool during the repair process
  • Adversarial Test Suite
    during the rebuttal phase (ref submission) competing teams can provide additional test-cases to invalidate competing team patches, we curate a set of test cases from each participating team

Label Build Success Public Tests Private Tests Adversarial Tests
Invalid failed - - -
Incorrect success failed - -
Over-fitting success success failed -
Correct success success success failed
High Quality success success success success
Response Time Assessment

For each repair task we capture the minimum time taken to produce a result to the user. A result can be of two formats, generating a plausible patch or an output stating cannot find a patch.

Resource Usage Assessment

For each repair task the memory usage, cpu usage and gpu usage to produce a result is captured and will be used to evaluate the efficiency of the tool.

Scoring Criteria

For each task, we aim to offer the highest reward to those tools which prioritize the generation of only correct and high quality solutions. To this purpose we shall penalize invalid, incorrect, and over-fitting patches.

Ranking Criteria

For each track, the candidate tools are ranked according to the total score per track. Candidates with the same score are further differentiated according to the Response Time. In case there is still a tie, we can further differentiate according to their Resource Usage Footprint.