Table of Contents
Fetching ...

Do LLMs Trust the Code They Write?

Francisco Ribeiro, Claudio Spiess, Prem Devanbu, Sarah Nadi

TL;DR

This work investigates whether LLMs encode internal representations of code correctness beyond their output probabilities. By adapting Representation Engineering (RepE) and its LAT component to source code, the authors extract a latent correctness signal from hidden states and demonstrate that this signal more reliably distinguishes correct from incorrect code than standard confidence metrics. They show that LAT improves correctness ranking and passes closer to the pass@10 ceiling on HumanEval and BigCodeBench across multiple 7–8B models, often outperforming the RankEF approach without requiring costly test executions or training. The results suggest practical utilities for test-free code ranking in development workflows and motivate further exploration of internal representations for broader non-functional properties. Data and code are publicly released to support reproducibility and future research in trustworthy AI-assisted programming.

Abstract

Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output of the generation process. Inspired by findings that LLMs internally encode concepts like truthfulness, this paper explores if LLMs similarly represent code correctness. Specifically, we identify a correctness representation inside LLMs by contrasting the hidden states between pairs of correct and incorrect code for the same programming tasks. By experimenting on four LLMs, we show that exploiting this extracted correctness representation outperforms standard log-likelihood ranking, as well as verbalized model confidence. Furthermore, we explore how this internal correctness signal can be used to select higher-quality code samples, without requiring test execution. Ultimately, this work demonstrates how leveraging internal representations can enhance code generation systems and make LLMs more reliable, thus improving confidence in automatically generated code.

Do LLMs Trust the Code They Write?

TL;DR

This work investigates whether LLMs encode internal representations of code correctness beyond their output probabilities. By adapting Representation Engineering (RepE) and its LAT component to source code, the authors extract a latent correctness signal from hidden states and demonstrate that this signal more reliably distinguishes correct from incorrect code than standard confidence metrics. They show that LAT improves correctness ranking and passes closer to the pass@10 ceiling on HumanEval and BigCodeBench across multiple 7–8B models, often outperforming the RankEF approach without requiring costly test executions or training. The results suggest practical utilities for test-free code ranking in development workflows and motivate further exploration of internal representations for broader non-functional properties. Data and code are publicly released to support reproducibility and future research in trustworthy AI-assisted programming.

Abstract

Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output of the generation process. Inspired by findings that LLMs internally encode concepts like truthfulness, this paper explores if LLMs similarly represent code correctness. Specifically, we identify a correctness representation inside LLMs by contrasting the hidden states between pairs of correct and incorrect code for the same programming tasks. By experimenting on four LLMs, we show that exploiting this extracted correctness representation outperforms standard log-likelihood ranking, as well as verbalized model confidence. Furthermore, we explore how this internal correctness signal can be used to select higher-quality code samples, without requiring test execution. Ultimately, this work demonstrates how leveraging internal representations can enhance code generation systems and make LLMs more reliable, thus improving confidence in automatically generated code.

Paper Structure

This paper contains 37 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of two candidate solutions for the task "Given a non-empty list of integers, return the sum of all of the odd elements that are in even positions."
  • Figure 2: LAT overview
  • Figure 3: Prompt template for correctness evaluation.
  • Figure 4: Accuracy of different ranking methods for HumanEval compared to the pass@1 baseline and pass@10 ceiling.
  • Figure 5: Accuracy of different ranking methods for BigCodeBench compared to the pass@1 baseline and pass@10 ceiling.