Table of Contents
Fetching ...

Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation

Yihan Dai, Sijie Liang, Haotian Xu, Peichu Xie, Sergey Mechtaev

TL;DR

This work tackles hallucinations in LLM-generated code by introducing semantic triangulation, which uses a non-semantics-preserving problem transformation $\tau$ and a hyperproperty $\phi$ to create cross-task consistency checks across transformed problem instances. The approach replaces single-solution plurality with a bijective mapping of solution classes and cascaded inverse/enumerator subproblems (e.g., $\text{FWD-INV}$, $\text{FWD-SINV}$, $\text{ENUM-SINV}$) implemented in just-tri-it, and it is supported by theory under a stochastic-parrot model with correlated errors using the rearrangement inequality. Empirically, semantic triangulation yields substantial gains on LiveCodeBench and CodeElo over baselines, enables reliable abstention, and handles inexact problems with multiple non-equivalent solutions, demonstrating practical benefits for automated code generation. The results show that exploiting cross-task consistency via bijective error mapping offers a robust, model-agnostic pathway to reduce code-generation hallucinations in black-box LLM settings.

Abstract

When generating code from natural language prompts, an LLM samples programs from a probability distribution, many of which might be incorrect. Sample consensus techniques - such as majority voting or validation against generated tests or specifications - aim to identify a correct program in the sample or abstain if none is valid. However, existing methods often fail to select a correct solution when its sampling probability is low, or when the problem permits multiple valid but non-equivalent solutions. Additionally, they often fail to abstain when no correct solution is present in the sample. To overcome these limitations, we introduce semantic triangulation, which transforms a programming problem in a way that non-trivially alters its semantics while preserving an exact, verifiable mapping between solutions before and after transformation. We theoretically establish that verifying consistency across such problem transformations increases confidence that generated programs reflect accurate generalization rather than spurious statistical correlations, enabling more reliable sample consensus and abstention. On the LiveCodeBench and CodeElo benchmarks, using GPT-4o and DeepSeek-V3 models, semantic triangulation increases reliability of generated code by 21% compared to the method that selects only high-confidence solutions with the probability threshold 0.5, while being able to pinpoint correct solutions at sampling probabilities as low as 0.14. Apart from that, it is also the only approach to consistently form true consensus on tasks with multiple valid but non-equivalent solutions.

Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation

TL;DR

This work tackles hallucinations in LLM-generated code by introducing semantic triangulation, which uses a non-semantics-preserving problem transformation and a hyperproperty to create cross-task consistency checks across transformed problem instances. The approach replaces single-solution plurality with a bijective mapping of solution classes and cascaded inverse/enumerator subproblems (e.g., , , ) implemented in just-tri-it, and it is supported by theory under a stochastic-parrot model with correlated errors using the rearrangement inequality. Empirically, semantic triangulation yields substantial gains on LiveCodeBench and CodeElo over baselines, enables reliable abstention, and handles inexact problems with multiple non-equivalent solutions, demonstrating practical benefits for automated code generation. The results show that exploiting cross-task consistency via bijective error mapping offers a robust, model-agnostic pathway to reduce code-generation hallucinations in black-box LLM settings.

Abstract

When generating code from natural language prompts, an LLM samples programs from a probability distribution, many of which might be incorrect. Sample consensus techniques - such as majority voting or validation against generated tests or specifications - aim to identify a correct program in the sample or abstain if none is valid. However, existing methods often fail to select a correct solution when its sampling probability is low, or when the problem permits multiple valid but non-equivalent solutions. Additionally, they often fail to abstain when no correct solution is present in the sample. To overcome these limitations, we introduce semantic triangulation, which transforms a programming problem in a way that non-trivially alters its semantics while preserving an exact, verifiable mapping between solutions before and after transformation. We theoretically establish that verifying consistency across such problem transformations increases confidence that generated programs reflect accurate generalization rather than spurious statistical correlations, enabling more reliable sample consensus and abstention. On the LiveCodeBench and CodeElo benchmarks, using GPT-4o and DeepSeek-V3 models, semantic triangulation increases reliability of generated code by 21% compared to the method that selects only high-confidence solutions with the probability threshold 0.5, while being able to pinpoint correct solutions at sampling probabilities as low as 0.14. Apart from that, it is also the only approach to consistently form true consensus on tasks with multiple valid but non-equivalent solutions.

Paper Structure

This paper contains 29 sections, 11 theorems, 23 equations, 15 figures.

Key Result

proposition 1

Let a code generation model $m$ be a stochastic parrot with correlated errors that hallucinates on problems with equal probability of correct solutions. There exists semantic triangulation $(\tau, \phi)$ such that for $p, q\sim m(\,\cdot\mid d)$ and $q'\sim m(\,\cdot\mid \tau(d)),$

Figures (15)

  • Figure 1: The intuition and structure of semantic triangulation. The notation $p \sim m(\,\cdot \mid d)$ means the program $p$ is sampled from a conditional distribution induced by an LLM $m$ given a problem description $d$.
  • Figure 2: Probabilities of sampling correct solutions from GPT‑4o for LiveCodeBench‑v6 problems (estimated over 100 trials).
  • Figure 3: just-tri-it selects a correct solution with the probability 0.07 in the presence of a dominant error with the probability 0.23 in the GPT-4o distribution, by applying cascading triangulation of an answer enumerator against a set-valued partial inverse, and a target forward solution against a triangulated enumerator.
  • Figure 4: Plurality suffers from correlated errors. Triangulation rearranges the mapping of programs to their witnesses so that large classes of program errors are matched with small classes of witness errors and vice versa, which decreases the probability of matching bugs as per the rearrangement inequality.
  • Figure 5: A fragment of prompt template used for set-valued inverse problem transformation. ORIG_SIGN is the typed signature of the original function, INV_SIGN is the signature of the inverse function, NEW_ARG is the argument of the inverse function representing the output of the original function, INV_ARG is the argument for inversion. The argument to invert (INV_ARG) is chosen by an LLM using a separate prompt.
  • ...and 10 more figures

Theorems & Definitions (32)

  • definition 1: Semantic Triangulation
  • definition 2: Semantic Program Equivalence
  • definition 3: Exactness
  • definition 4: Program Correctness w.r.t. Problem Description
  • definition 5: Confidence-Enhancing Plausibility Witness Problem
  • proposition 1
  • proof
  • proposition 2
  • proof
  • proposition 3: Triangulation Generalizes Plurality
  • ...and 22 more