EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking
Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang, Ke Wang, Alex Aiken
TL;DR
EquiBench introduces a fully automated, large-scale benchmark for evaluating LLMs on semantic reasoning about code via equivalence checking. By constructing 2400 program pairs across Python, C, CUDA, and x86-64 through diverse automated transformations (DCE, compiler scheduling, superoptimization, and contest problems), it probes deep program semantics rather than surface syntax. Evaluations across 19 models show that even state-of-the-art models struggle in challenging categories, with few-shot prompts and Chain-of-Thought prompting offering limited gains, and biases toward syntactic similarity observed. The work highlights fundamental limits in current semantic understanding of programs by LLMs and suggests directions for training and benchmark design to advance robust program reasoning. EquiBench thus provides a rigorous, extensible framework for diagnosing and guiding future improvements in semantically aware code reasoning systems.
Abstract
As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
