Table of Contents
Fetching ...

Agentic Code Reasoning

Shubham Ugare, Satish Chandra

TL;DR

It is demonstrated that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

Abstract

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

Agentic Code Reasoning

TL;DR

It is demonstrated that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

Abstract

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.
Paper Structure (49 sections, 2 figures, 7 tables)

This paper contains 49 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Motivating example from Django (django-13670): Standard reasoning assumes format() is Python's builtin; semi-formal analysis traces the actual definition and discovers a module-level function with different semantics, causing Patch 1 to raise AttributeError. See Appendix \ref{['app:mockito8']} for a detailed semi-formal proof walkthrough.
  • Figure 2: Condensed semi-formal certificate template for patch equivalence. The agent must fill in every bracketed field with evidence gathered from the codebase. The full template includes additional fields for pass-to-pass tests, confidence levels, and alternative hypothesis checks. Templates for other tasks follow the same structure but with task-specific premises and claims. The full prompt appears in Appendix \ref{['app:patch-equiv']}.

Theorems & Definitions (1)

  • Definition 1: Patch Equivalence Modulo Tests