Table of Contents
Fetching ...

Calibration and Correctness of Language Models for Code

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed

TL;DR

This work investigates how well confidence signals from code-generating LLMs align with actual code correctness across function synthesis, line-level completion, and program repair. It introduces an evaluation framework with intrinsic and reflective confidence measures, and demonstrates that out-of-the-box calibration is generally poor. The authors show that Platt scaling improves calibration in many cases, though not universally, and highlight risks like bucket collapse. A notable finding is that retrieval-augmented few-shot prompting can substantially boost calibrated confidence for line-level code completion, suggesting practical paths toward risk-aware deployment of coding assistants.

Abstract

Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. A well-calibrated confidence measure can serve as a basis for rational, graduated decision-making on how much review and care is needed when using generated code. Calibration has so far been studied in mostly non-generative (e.g. classification) settings, especially in software engineering. However, generated code can quite often be wrong: Given generated code, developers must decide whether to use directly, use after varying intensity of careful review, or discard model-generated code. Thus, calibration is vital in generative settings. We make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that, by and large, generative code models we test are not well-calibrated out of the box. We then show how calibration can be improved using standard methods, such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in software engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.

Calibration and Correctness of Language Models for Code

TL;DR

This work investigates how well confidence signals from code-generating LLMs align with actual code correctness across function synthesis, line-level completion, and program repair. It introduces an evaluation framework with intrinsic and reflective confidence measures, and demonstrates that out-of-the-box calibration is generally poor. The authors show that Platt scaling improves calibration in many cases, though not universally, and highlight risks like bucket collapse. A notable finding is that retrieval-augmented few-shot prompting can substantially boost calibrated confidence for line-level code completion, suggesting practical paths toward risk-aware deployment of coding assistants.

Abstract

Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. A well-calibrated confidence measure can serve as a basis for rational, graduated decision-making on how much review and care is needed when using generated code. Calibration has so far been studied in mostly non-generative (e.g. classification) settings, especially in software engineering. However, generated code can quite often be wrong: Given generated code, developers must decide whether to use directly, use after varying intensity of careful review, or discard model-generated code. Thus, calibration is vital in generative settings. We make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that, by and large, generative code models we test are not well-calibrated out of the box. We then show how calibration can be improved using standard methods, such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in software engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.
Paper Structure (40 sections, 6 equations, 11 figures, 8 tables)

This paper contains 40 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Sample calibration plots demonstrating well- vs. poorly- calibrated.
  • Figure 2: Reliability plots for DyPyBench line-level code completion tasks, with respect to All Pass @1 correctness measure and Average Token Probability confidence measure. Gpt-3.5 was used for both experiments. Bottom histogram represents number of samples in each bin. $\mathcal{B}_{ref}$ refers to the unskilled predictor Brier, $ECE$ to Expected Calibration Error, $\mathcal{B}$ to Brier Score, and $SS$ to Skill Score. Red & purple lines represent scaled & non-scaled quantile bins rather than evenly spaced bins with 1/5 of the data at each point. The left nonscaled plot shows over-confidence, as the confidence estimate is high, but the actual correctness is low. The scaled plot (right) improves calibration.
  • Figure 3: Few-shot reflective reliability plot, based on "FS BM25" row of \ref{['table:fewshot']}
  • Figure A1: Prompts for Verbalized Self-Ask and Question Answering logit.
  • Figure A2: Prompt and model output for the tasks while calculating confidence measure based on Average Token Probability and Generated Sequence Probability.
  • ...and 6 more figures