LoCaL: Countering Surface Bias in Code Evaluation Metrics

Generative AI & LLMs
Published: arXiv: 2509.15397v1
Authors

Simantika Bhattacharjee Dristi Matthew B. Dwyer

Abstract

With the increasing popularity of large language models (LLMs) and LLM-based agents, reliable and effective code evaluation metrics (CEMs) have become crucial for progress across several software engineering tasks. While popular benchmarks often provide test cases to assess the correctness of generated code, crafting and executing test cases is expensive. Reference-based CEMs provide a cheaper alternative by scoring a candidate program based on its functional similarity to a reference. Although prior research has focused on reporting the weak correlation between these CEMs and functional correctness, the causes are only assumed, and plausible solutions remain unexplored. In this work, we critically evaluate four state-of-the-art reference-based CEMs, revealing their strong bias towards surface-level features rather than code functionality. Despite this surface bias, current evaluation datasets for these CEMs rarely include code pairs that are surface-similar yet functionally dissimilar, or functionally similar yet surface-dissimilar. To mitigate this gap, we propose LoCaL (Looks Can Lie), a CEM evaluation benchmark, with 3117 code pairs at both the method and program levels. Each pair is labeled with a functional similarity score and aims to target regions where CEMs are likely to perform poorly. The functional similarity scores are calculated through differential fuzzing, which eliminates the need for predefined test cases and, at the same time, improves the reliability of the scores by executing an order of magnitude more tests than prior work. We find that all four CEMs show significant performance degradation on LoCaL, compared to the baselines. Finally, based on our findings, we draw the implication that exposing CEMs to LoCaL-like data might facilitate the development of metrics that are robust to surface bias.

Paper Summary

Problem
Code evaluation metrics (CEMs) are used to assess the quality of machine-generated code, but they often have a "surface bias" problem. This means that they prioritize code that looks similar to the original code, even if it doesn't actually work correctly. This can lead to code that is functionally incorrect being rated highly, which can cause problems in software development.
Key Innovation
The researchers propose a new benchmark called LoCaL, which is designed to counter surface bias in CEMs. LoCaL uses a differential fuzzing-based strategy to generate thousands of test cases and compute reliable similarity scores for code pairs. This allows LoCaL to identify and highlight code pairs that have a significant gap between their surface similarity and functional similarity.
Practical Impact
LoCaL can be used to evaluate and improve CEMs, which can lead to better software development practices. By reducing the bias towards surface-level features, LoCaL can help developers create code that is more functional and less prone to errors. Additionally, LoCaL provides reusable ground-truth similarity scores for downstream tasks like code optimization, code clone detection, code refactoring, and automated bug repair.
Analogy / Intuitive Explanation
Imagine you're trying to teach a child to write a simple sentence, like "hello world". A CEM that has a surface bias might give high marks to a sentence that looks similar to the original, but has a typo, like "hellos world". LoCaL is like a teacher who says, "no, that's not correct, let's try again". By focusing on functional correctness, LoCaL helps CEMs to give more accurate ratings and improve the quality of generated code.
Paper Information
Categories:
cs.SE
Published Date:

arXiv ID:

2509.15397v1

Quick Actions