Table of Contents
Fetching ...

Predicting Program Correctness By Ensemble Semantic Entropy

Yunxiang Wei, Tianlin Li, Yuwei Zheng, Yanni Dong, Aishan Liu, Qiang Hu, Xiaoyu Zhang, Mingfei Cheng, Jian Yang

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in generating programs from natural language descriptions, yet ensuring their correctness without an external oracle remains a critical challenge. To solve the challenge, existing methods often rely on uncertainty estimation, measuring the consistency of semantics or execution behaviors across multiple samples generated by a single model. However, we observe that a single model can often converge to a consistent but incorrect solution, rendering such consistency-based proxies ineffective. To address this, we propose Ensemble Semantic Entropy (ESE), which estimates uncertainty by evaluating the consistency of samples aggregated across an ensemble of models. Experiments on LiveCodeBench demonstrate that ESE correlates more strongly with program correctness than single-model semantic entropy. Notably, in selective generation tasks with strict false-positive rate constraints, ESE improves prediction accuracy by 53.4%. Furthermore, by leveraging ESE as the decision signal, we propose a cascading test-time scaling framework Cas, which maintains performance while reducing FLOPs by 64.9% compared to single-model scaling, offering a new perspective on balancing parameter and inference scaling.

Predicting Program Correctness By Ensemble Semantic Entropy

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in generating programs from natural language descriptions, yet ensuring their correctness without an external oracle remains a critical challenge. To solve the challenge, existing methods often rely on uncertainty estimation, measuring the consistency of semantics or execution behaviors across multiple samples generated by a single model. However, we observe that a single model can often converge to a consistent but incorrect solution, rendering such consistency-based proxies ineffective. To address this, we propose Ensemble Semantic Entropy (ESE), which estimates uncertainty by evaluating the consistency of samples aggregated across an ensemble of models. Experiments on LiveCodeBench demonstrate that ESE correlates more strongly with program correctness than single-model semantic entropy. Notably, in selective generation tasks with strict false-positive rate constraints, ESE improves prediction accuracy by 53.4%. Furthermore, by leveraging ESE as the decision signal, we propose a cascading test-time scaling framework Cas, which maintains performance while reducing FLOPs by 64.9% compared to single-model scaling, offering a new perspective on balancing parameter and inference scaling.

Paper Structure

This paper contains 24 sections, 18 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Calculation of Ensemble Semantic Entropy on a motivating example. The problem requires traversing an array nums backwards to collect integers from $1$ to $k$ with minimal steps. Qwen3-8B generates three programs that share the same incorrect semantics (only counting numbers within $[1, k]$), resulting in zero semantic entropy. GLM4-9B, however, generates programs that incorrectly output indices in forward order. By aggregating these five programs, the ensemble reveals the semantic disagreement, resulting in high uncertainty that correctly flags the error, effectively avoiding the false positive.
  • Figure 2: Comparison of the distribution of the largest cluster sizes in incorrect problems. Single models (GLM4-9B, Qwen3-8B) frequently show high consistency in incorrect answers, whereas the ensemble substantially reduces this spurious consistency.
  • Figure 3: Pearson correlation coefficients between uncertainty and mean program correctness under different semantic clustering methods. For each method, the left bar denotes SE and the right bar denotes ESE. Results are reported for four representative backbones.
  • Figure 4: Accuracy-cost comparison on LiveCodeBench across four test-time scaling methods, Majority Voting, S*, Cas (w/o ensemble), and Cas, over five model configurations. Pass@20 is shown as the corresponding selection upper bound for each model. Methods in the upper-left region exhibit better accuracy-cost trade-offs.