Table of Contents
Fetching ...

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He

TL;DR

TaTToo tackles the gap in reward supervision for tabular reasoning by introducing a table-grounded PRM with tool integration that explicitly verifies table retrieval and schema interactions. It builds a large-scale dataset (~60,000 instances) of verification rationales and learns tool-enabled reasoning through a dual-stage process: supervised fine-tuning to capture tool-use patterns and reinforcement learning with tool-grounded reward shaping. Across five tabular benchmarks, TaTToo achieves substantial gains with 8B parameters and generalizes to diverse test-time strategies, outperforming larger PRMs while maintaining efficiency. This work demonstrates the value of table-aware supervision and external tool integration for scalable, accurate tabular reasoning in large language models.

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

TL;DR

TaTToo tackles the gap in reward supervision for tabular reasoning by introducing a table-grounded PRM with tool integration that explicitly verifies table retrieval and schema interactions. It builds a large-scale dataset (~60,000 instances) of verification rationales and learns tool-enabled reasoning through a dual-stage process: supervised fine-tuning to capture tool-use patterns and reinforcement learning with tool-grounded reward shaping. Across five tabular benchmarks, TaTToo achieves substantial gains with 8B parameters and generalizes to diverse test-time strategies, outperforming larger PRMs while maintaining efficiency. This work demonstrates the value of table-aware supervision and external tool integration for scalable, accurate tabular reasoning in large language models.

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

Paper Structure

This paper contains 43 sections, 4 theorems, 21 equations, 9 figures, 5 tables.

Key Result

Theorem 4.1

Given the current policy $\pi$, after one natural policy gradient update step guided by the PRM reward $r_i$ defined in Eq.eqa:prm_reward_, we obtain the revised policy $\pi'(a_i \mid \mathbf{s}_i) \propto \exp(Q^{\pi}(\mathbf{s}_i, a_i) + r_i(\mathbf{s}_i, a_i))$. The resulting expected policy impr where $A^{\pi}(\mathbf{s}_i, a_i) = Q^\pi(\mathbf{s}_i, a_i) - V^{\pi}(\mathbf{s}_i)$ denotes the

Figures (9)

  • Figure 1: Best-of-N performance of DeepSeek-R1-Distill-Qwen-14B across 3 table tasks on TableBench with different types of step verifiers.
  • Figure 2: Error Distribution over 4 step categories across 500 incorrect cases after Best-of-N selection.
  • Figure 3: Left: PRM's rewards on 500 reasoning steps with the real-retrieved/randomly-replaced sub-table. Middle: Layer-wise average attention mass vs. relative step distance in tabular reasoning. Attention concentrates on nearby steps, with sharp decay as distance increases. Right: Best-of-N results on DeepSeek-R1-Distill-Qwen-14B for numerical reasoning with/without the table prefix.
  • Figure 4: Overview of TaTToo framework. We first curate 60k high-quality instances by collecting expert verification rationales with tool integration (Section \ref{['sec:data_curation']}). We then train our PRM through a dual-stage training paradigm to achieve tool-grounded step-by-step reward supervision (Section \ref{['sec:training_PRM']}).
  • Figure 5: Performance of TaTToo on two additional TTS strategies, Beam Search and Diverse Verifier Tree Search (DVTS). We report the average accuracy across all 5 tabular reasoning tasks.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 4.1: Policy Improvement (Lower Bound)
  • Lemma D.1: Performance Difference Lemma (PDL)
  • Lemma D.2: Natural policy gradient (NPG) update form
  • Proposition D.3: Full-strength policy improvement lower bound
  • proof : Proof of Proposition \ref{['prop:full-strength']}
  • Remark D.4