Table of Contents
Fetching ...

Disproving Program Equivalence with LLMs

Miltiadis Allamanis, Pengcheng Yin

TL;DR

ProbeGen presents a white-box, execution-feedback-driven method to disprove functional equivalence among code implementations by generating usage probes with LLMs. It complements unit tests by exposing semantic differences that unit tests often miss, and it enables semantic clustering and self-consistency analyses to improve code synthesis evaluation. Across the LBPP code-synthesis benchmark, ProbeGen finds significant semantic differences in samples deemed equivalent by unit tests (about 18–24%), and deeper tree-search strategies can boost discovery efficiency with manageable cost. The results suggest ProbeGen is a valuable tool for evaluating and debugging LLM-generated code, particularly when unit tests are weak or unavailable, and for enabling more interpretable, semantically aware analyses of code proposals.

Abstract

To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

Disproving Program Equivalence with LLMs

TL;DR

ProbeGen presents a white-box, execution-feedback-driven method to disprove functional equivalence among code implementations by generating usage probes with LLMs. It complements unit tests by exposing semantic differences that unit tests often miss, and it enables semantic clustering and self-consistency analyses to improve code synthesis evaluation. Across the LBPP code-synthesis benchmark, ProbeGen finds significant semantic differences in samples deemed equivalent by unit tests (about 18–24%), and deeper tree-search strategies can boost discovery efficiency with manageable cost. The results suggest ProbeGen is a valuable tool for evaluating and debugging LLM-generated code, particularly when unit tests are weak or unavailable, and for enabling more interpretable, semantically aware analyses of code proposals.

Abstract

To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

Paper Structure

This paper contains 29 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Given a task description, an interface (function in this example) and a number of implementations of the interface, an LLM is asked to generate code (probe) that uses the interface and causes the implementations to diverge in behavior (in this example, return different output lists). This provides a counterexample where the implementations differ (as proven by their execution). However, the -- often implicit -- pre- or postconditions of the interface may be ambiguous. A final step filters spurious counterexamples and yields a partition of the implementations (colored in orange and green) into non-equivalent groups.
  • Figure 2: Sample probe vs. unit test for a fictional class interface . A probe only provides inputs in a usage example, whereas a unit test needs to make concrete assertions about the expected values. This makes probes easier to generate and allows us to perform more accurate testing for equivalence.
  • Figure 3: ProbeGen as a tree search process. At each step $k$ independent samples may be drawn, up to a maximum depth $D$. When a probe successfully disproves the equivalence (green nodes) it forms a leaf node.
  • Figure 4: Relationship between the branching factor $k$ and search depth to the probability $S$ of successfully generating a probe that exposes a counterexample and the trade-offs with the inference cost. The two contour plots (left) show the probability of success (lighter color is higher) with respect to $K$ and depth for two search strategies (plots for all strategies can be found at \ref{['fig:search trees all']}). The lines in the contour plots trace (from the bottom left) the Pareto-optimal trade-off between inference cost and probability of success $S$. \ref{['fig:strategy comparison']} plots the best probability of success $S$ for a given inference cost for each search strategy.
  • Figure 5: Semantic Clustering through Probe Generation. Given a number of implementation $f_1, ...$ of an interface $\mathbb{I}$ we can generate probes that partition the set of implementations based on their outputs, clustering the implementations based on their semantics.
  • ...and 1 more figures