Table of Contents
Fetching ...

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang, Ke Wang, Alex Aiken

TL;DR

EquiBench introduces a fully automated, large-scale benchmark for evaluating LLMs on semantic reasoning about code via equivalence checking. By constructing 2400 program pairs across Python, C, CUDA, and x86-64 through diverse automated transformations (DCE, compiler scheduling, superoptimization, and contest problems), it probes deep program semantics rather than surface syntax. Evaluations across 19 models show that even state-of-the-art models struggle in challenging categories, with few-shot prompts and Chain-of-Thought prompting offering limited gains, and biases toward syntactic similarity observed. The work highlights fundamental limits in current semantic understanding of programs by LLMs and suggests directions for training and benchmark design to advance robust program reasoning. EquiBench thus provides a rigorous, extensible framework for diagnosing and guiding future improvements in semantically aware code reasoning systems.

Abstract

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

TL;DR

EquiBench introduces a fully automated, large-scale benchmark for evaluating LLMs on semantic reasoning about code via equivalence checking. By constructing 2400 program pairs across Python, C, CUDA, and x86-64 through diverse automated transformations (DCE, compiler scheduling, superoptimization, and contest problems), it probes deep program semantics rather than surface syntax. Evaluations across 19 models show that even state-of-the-art models struggle in challenging categories, with few-shot prompts and Chain-of-Thought prompting offering limited gains, and biases toward syntactic similarity observed. The work highlights fundamental limits in current semantic understanding of programs by LLMs and suggests directions for training and benchmark design to advance robust program reasoning. EquiBench thus provides a rigorous, extensible framework for diagnosing and guiding future improvements in semantically aware code reasoning systems.

Abstract

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
Paper Structure (57 sections, 7 figures, 4 tables)

This paper contains 57 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of EquiBench. We construct (in)equivalent program pairs from diverse sources, including C and CUDA programs, x86-64 assembly, and competitive programming, using automated transformations based on program analysis, compiler scheduling, superoptimization, and changes in algorithms or variable names.
  • Figure 2: An equivalent pair from the DCE category in EquiBench. In the left program, c = 1 is dead code that has no effect on the program state and is removed in the right program. Such pairs are generated using the Dead Code Elimination (DCE) pass in compilers.
  • Figure 3: An equivalent pair from the CUDA category in EquiBench. Both programs perform matrix-vector multiplication ($y = Ax$). The right-hand program uses shared memory tiling to improve performance. Tensor compilers are utilized to explore different scheduling strategies, automating the generation.
  • Figure 4: An equivalent pair from the x86-64 category in EquiBench. Both programs are compiled from the same C function shown above, the left using a compiler and the right using a superoptimizer. The function counts the number of set bits in the input %rdi register and stores the result in %rax. Their equivalence has been formally verified by the superoptimizer.
  • Figure 5: Equivalent pairs from the OJ_A, OJ_V, OJ_VA categories in EquiBench. OJ_A pairs demonstrate algorithmic equivalence, OJ_V pairs involve variable renaming transformations, and OJ_VA pairs combine both types of variations.
  • ...and 2 more figures