Table of Contents
Fetching ...

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue, Ameet Talwalkar, Wayne Chi, Valerie Chen

Abstract

As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.

Comparing Developer and LLM Biases in Code Evaluation

Abstract

As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.

Paper Structure

This paper contains 56 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Example of developer–LLM misalignment on a code editing task. Here, the developer provides a prompt and receives two LLM code solutions. In this example, the user prefers the top response while the LLM judge selects the bottom response. We compare these responses with the extracted rubric items to see that the developer prefers less robustness and more comments, while LLM judges prefer more robustness and fewer comments.
  • Figure 2: Overview of TRACE. Given a set of pairwise options, TRACE follows a three-step workflow: (1) we collect LLM judgments between responses to measure alignment with human preferences; (2) we automatically generate rubric criteria capturing differences between responses (e.g., error handling), then aggregate these criteria to form a comprehensive evaluation rubric; (3) we construct feature vectors from rubric scores on each sample and train a logistic regression model to predict LLM judgments. We use the learned coefficients to identify which rubric dimensions drive misalignment between LLMs and humans.
  • Figure 3: Judge misalignment reveals distinct rubric biases across interaction modalities. Each cell shows the signed difference between judge and human preference coefficients ($\beta_J^{(i)} - \beta_H^{(i)}$) for selected rubric items within each interaction modality. Positive values (red) indicate that judges overweight a rubric item relative to humans, while negative values (blue) indicate underweighting. Rows show the highest-divergence rubric dimensions within each modality. Bolded values indicate significant judge-human gaps, defined as cases where the 95% confidence interval for $\beta_J^{(i)}$ excludes $\beta_H^{(i)}$.
  • Figure 4: Reviewer portal used for the human baseline study. The interface shows the task context and two candidate answers (Option 1/Option 2); the mapping from underlying candidates to displayed options is randomized to mitigate positional effects.
  • Figure 5: Code completion alignment across all judges and rubrics.
  • ...and 2 more figures