Do LLMs Trust the Code They Write?

Francisco Ribeiro; Claudio Spiess; Prem Devanbu; Sarah Nadi

Do LLMs Trust the Code They Write?

Francisco Ribeiro, Claudio Spiess, Prem Devanbu, Sarah Nadi

TL;DR

This work investigates whether LLMs encode internal representations of code correctness beyond their output probabilities. By adapting Representation Engineering (RepE) and its LAT component to source code, the authors extract a latent correctness signal from hidden states and demonstrate that this signal more reliably distinguishes correct from incorrect code than standard confidence metrics. They show that LAT improves correctness ranking and passes closer to the pass@10 ceiling on HumanEval and BigCodeBench across multiple 7–8B models, often outperforming the RankEF approach without requiring costly test executions or training. The results suggest practical utilities for test-free code ranking in development workflows and motivate further exploration of internal representations for broader non-functional properties. Data and code are publicly released to support reproducibility and future research in trustworthy AI-assisted programming.

Abstract

Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output of the generation process. Inspired by findings that LLMs internally encode concepts like truthfulness, this paper explores if LLMs similarly represent code correctness. Specifically, we identify a correctness representation inside LLMs by contrasting the hidden states between pairs of correct and incorrect code for the same programming tasks. By experimenting on four LLMs, we show that exploiting this extracted correctness representation outperforms standard log-likelihood ranking, as well as verbalized model confidence. Furthermore, we explore how this internal correctness signal can be used to select higher-quality code samples, without requiring test execution. Ultimately, this work demonstrates how leveraging internal representations can enhance code generation systems and make LLMs more reliable, thus improving confidence in automatically generated code.

Do LLMs Trust the Code They Write?

TL;DR

Abstract

Do LLMs Trust the Code They Write?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)