Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation
Mohammad Abdollahi, Khandaker Rifah Tasnia, Soumit Kanti Saha, Jinqiu Yang, Song Wang, Hadi Hemmati
TL;DR
Understanding LLMs' runtime reasoning traces in code execution remains underexplored. The authors build a four-step empirical study using a benchmark from HumanEval+ and LiveCodeBench (427 snippets) with regular, edge, and invalid inputs, evaluating four reasoning-oriented LLMs and collecting explicit reasoning traces. They develop a two-level error annotation scheme and a nine-category taxonomy of reasoning failures, and propose tool-augmented reasoning that fixes about 58% of Computation Errors. The study demonstrates meaningful error modes in LLM reasoning and shows practical gains from external tooling, with data and code released to support reproducibility.
Abstract
Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language models (LLMs) can accurately predict program outputs, most prior work has focused on output accuracy and performance, treating reasoning as a black box. As a result, little is known about the structure or failure modes of their reasoning traces. To address this gap, we conduct the first empirical study on runtime behavior inference with reasoning LLMs, aiming to uncover and characterize errors in their reasoning traces. We curate a benchmark from HumanEval Plus and LiveCodeBench, containing 427 code snippets. For each snippet, we test three input types: regular, edge, and invalid. Twelve input values are selected per snippet, each paired with its ground-truth execution result. We evaluate four state-of-the-art reasoning LLMs. Our results show that these models reach accuracies between 85 percent and 98 percent across input types. We also analyze the produced reasoning traces and develop a taxonomy with nine categories of inference errors. Finally, we explore tool-augmented reasoning. Using failures in the Computation Errors category as a case study, our experiments show that this approach corrects 58 percent of such errors, demonstrating the potential of tool support for improving LLM reasoning.
