Evaluating Program Semantics Reasoning with Type Inference in System F

Yifeng He; Luning Yang; Christopher Castro Gaw Gonzalo; Hao Chen

Evaluating Program Semantics Reasoning with Type Inference in System F

Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen

TL;DR

This work introduces TF-Bench, a formal benchmark that measures LLMs' ability to reason about program semantics through type inference in System $F$, thereby isolating deductive semantics from natural-language cues. It pairs TF-Bench with TF-Bench$_{\text{pure}}$, a NL-free variant with 188 tasks, and evaluates 64 LLMs using a rigorous alpha-equivalence based evaluation, complemented by two new metrics: semantic robustness and reasoning effectiveness. The study reveals substantial gaps in current models, with the best TF-Bench$_{\text{pure}}$ performance around $55.85\%$, and demonstrates that TTC-based reasoning can improve or mislead depending on the model and data used for fine-tuning. The results underscore the need for program-centric evaluation in LLMs and offer concrete guidance for designing future reasoning-focused training and evaluation protocols.

Abstract

Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.

Evaluating Program Semantics Reasoning with Type Inference in System F

TL;DR

Abstract

Evaluating Program Semantics Reasoning with Type Inference in System F

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)