Table of Contents
Fetching ...

TabTracer: Monte Carlo Tree Search for Complex Table Reasoning with Large Language Models

Zhizhao Luo, Zhaojing Luo, Meihui Zhang, Rui Mao

TL;DR

TabTracer is an agentic framework that coordinates multi-step tool calls over intermediate table states, with explicit state tracking for verification and rollback, and reduces redundancy with budget-aware pruning, deduplication, and state hashing with a monotonicity gate to cut token cost.

Abstract

Large language models (LLMs) have emerged as powerful tools for natural language table reasoning, where there are two main categories of methods. Prompt-based approaches rely on language-only inference or one-pass program generation without step-level verification. Agent-based approaches use tools in a closed loop, but verification is often local and backtracking is limited, allowing errors to propagate and increasing cost. Moreover, they rely on chain- or beam-style trajectories that are typically combinatorially redundant, leading to high token costs. In this paper, we propose TabTracer, an agentic framework that coordinates multi-step tool calls over intermediate table states, with explicit state tracking for verification and rollback. First, it enforces step-level verification with typed operations and lightweight numeric and format checks to provide reliable rewards and suppress hallucinations. Second, execution-feedback Monte Carlo Tree Search maintains a search tree of candidate table states and uses backpropagated reflection scores to guide UCB1 selection and rollback via versioned snapshots. Third, it reduces redundancy with budget-aware pruning, deduplication, and state hashing with a monotonicity gate to cut token cost. Comprehensive evaluation on TabFact, WikiTQ, and CRT datasets shows that TabTracer outperforms state-of-the-art baselines by up to 6.7% in accuracy while reducing token consumption by 59--84%.

TabTracer: Monte Carlo Tree Search for Complex Table Reasoning with Large Language Models

TL;DR

TabTracer is an agentic framework that coordinates multi-step tool calls over intermediate table states, with explicit state tracking for verification and rollback, and reduces redundancy with budget-aware pruning, deduplication, and state hashing with a monotonicity gate to cut token cost.

Abstract

Large language models (LLMs) have emerged as powerful tools for natural language table reasoning, where there are two main categories of methods. Prompt-based approaches rely on language-only inference or one-pass program generation without step-level verification. Agent-based approaches use tools in a closed loop, but verification is often local and backtracking is limited, allowing errors to propagate and increasing cost. Moreover, they rely on chain- or beam-style trajectories that are typically combinatorially redundant, leading to high token costs. In this paper, we propose TabTracer, an agentic framework that coordinates multi-step tool calls over intermediate table states, with explicit state tracking for verification and rollback. First, it enforces step-level verification with typed operations and lightweight numeric and format checks to provide reliable rewards and suppress hallucinations. Second, execution-feedback Monte Carlo Tree Search maintains a search tree of candidate table states and uses backpropagated reflection scores to guide UCB1 selection and rollback via versioned snapshots. Third, it reduces redundancy with budget-aware pruning, deduplication, and state hashing with a monotonicity gate to cut token cost. Comprehensive evaluation on TabFact, WikiTQ, and CRT datasets shows that TabTracer outperforms state-of-the-art baselines by up to 6.7% in accuracy while reducing token consumption by 59--84%.
Paper Structure (34 sections, 22 equations, 6 figures, 5 tables, 4 algorithms)

This paper contains 34 sections, 22 equations, 6 figures, 5 tables, 4 algorithms.

Figures (6)

  • Figure 1: Prompt-based and agent-based outputs fail to complete the aggregation, while TabTracer(our approach) slices the table to count songs per date and aggregate by month (Nov=9 vs Jan=3).
  • Figure 2: The reasoning layer includes planning and reflection, the execution layer issues atomic dataframe tools, and the versioned storage layer preserves snapshots for fallback and retry.
  • Figure 3: Tool ablation on CRT. We report exact-match accuracy and token usage when removing each operator.
  • Figure 4: Tool success rate (SR) and adoption rate (AR) across stages.
  • Figure 5: Average simulations and state reuse rate on WikiTQ and CRT.
  • ...and 1 more figures