Showing LLM-Generated Code Selectively Based on Confidence of LLMs

Jia Li; Yuqi Zhu; Yongmin Li; Ge Li; Zhi Jin

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

Jia Li, Yuqi Zhu, Yongmin Li, Ge Li, Zhi Jin

TL;DR

The paper tackles the risk of exposing erroneous LLM-generated code by introducing HonestCoder, a system that estimates confidence in code generation using a multi-modal similarity of multiple sampled outputs and selectively displays only high-confidence results. It defines TruthCodeBench to benchmark this approach across Python and Java with four prominent LLMs, demonstrating substantial improvements in distinguishing correct vs incorrect outputs (AUROC/AUCPR) and reducing erroneous disclosures. HonestCoder achieves this with a lightweight overhead (~0.4s) and shows robustness across languages and model sizes, while highlighting the value of a multi-modal, execution-free confidence estimator. The work lays groundwork for reliable, human-in-the-loop code generation and invites further exploration of confidence-aware, retrieval-augmented, and cross-domain code generation techniques.

Abstract

Large Language Models (LLMs) have shown impressive abilities in code generation, but they may generate erroneous programs. Reading a program takes ten times longer than writing it. Showing these erroneous programs to developers will waste developers' energies and introduce security risks to software. To address the above limitations, we propose HonestCoder, a novel LLM-based code generation approach. HonestCoder selectively shows the generated programs to developers based on LLMs' confidence. The confidence provides valuable insights into the correctness of generated programs. To achieve this goal, we propose a novel approach to estimate LLMs' confidence in code generation. It estimates confidence by measuring the multi-modal similarity between LLMs-generated programs. We collect and release a multilingual benchmark named TruthCodeBench, which consists of 2,265 samples and covers two popular programming languages (i.e., Python and Java). We apply HonestCoder to four popular LLMs (e.g., DeepSeek-Coder and Code Llama) and evaluate it on TruthCodeBench. Based on the experiments, we obtain the following insights. (1) HonestCoder can effectively estimate LLMs' confidence and accurately determine the correctness of generated programs. For example, HonestCoder outperforms the state-of-the-art baseline by 27.79% in AUROC and 63.74% in AUCPR. (2) HonestCoder can decrease the number of erroneous programs shown to developers. Compared to eight baselines, it can show more correct programs and fewer erroneous programs to developers. (3) Compared to showing code indiscriminately, HonestCoder only adds slight time overhead (approximately 0.4 seconds per requirement). (4) We discuss future directions to facilitate the application of LLMs in software development. We hope this work can motivate broad discussions about measuring the reliability of LLMs' outputs in performing code-related tasks.

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

TL;DR

Abstract

Paper Structure (24 sections, 10 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 10 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Motivating Examples
HonestCoder
An Overview of HonestCoder
Code Generation
Confidence Estimation
Response Output
Evaluation Benchmark - TruthCodeBench
An Overview of TruthCodeBench
Benchmark Collection Pipeline
Study Design
Research Questions
Studied LLMs
Compared Baselines
Evaluation Metrics
...and 9 more sections

Figures (6)

Figure 1: The comparison of (a) previous LLM-based code generation approaches and (b) our HonestCoder. Previous approaches indiscriminately show developers the generated code, including the erroneous code. In contrast, HonestCoder selectively shows the generated code based on LLMs' confidence. When LLMs are uncertain, HonestCoder outputs "I can not solve this requirement".
Figure 2: (a) The performance of a popular LLM - DeepSeek Coder-6.7B on two popular benchmarks CodexMBPP. It may generate programs with errors (e.g., functional errors). (b) The similarity distribution between programs in passed/failed requirements. The programs in passed requirements are more deterministic (i.e., higher similarity) than the ones in failed requirements.
Figure 3: An overview of HonestCoder. Given a requirement, it samples multiple programs from LLMs (Section \ref{['sec:method:code_gen']}). Then, it leverages an estimator to estimate the LLMs' confidence by measuring the multi-modal similarities between sampled programs (Section \ref{['sec:method:confidence']}). Finally, it determines whether to show the generated programs based on confidence (Section \ref{['sec:method:response']}).
Figure 4: Two samples of TruthCodeBench in Python and Java, respectively. Each sample has two components: a requirement and LLM-specific binary labels (i.e., passed or failed). The label represents whether a specific LLM can generate the correct programs for a requirement. We expect that HonestCoder only shows the generated code for passed requirements and refuses failed requirements.
Figure 5: The number of correct and erroneous programs shown to developers on TruthCodeBench. The stars denote the results of showing LLM-generated programs indiscriminately. The curves denote the results of baselines and HonestCoder under different thresholds. The closer the curve is to the upper left corner, the better the approach's performance.
...and 1 more figures

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

TL;DR

Abstract

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)