Table of Contents
Fetching ...

TABQAWORLD: Optimizing Multimodal Reasoning for Multi-Turn Table Question Answering

Tung Sum Thomas Kwok, Xinyu Wang, Xiaofeng Lin, Peng Lu, Chunhe Wang, Changlun Li, Hanwei Wu, Nan Tang, Elisa Kreiss, Guang Cheng

Abstract

Multimodal reasoning has emerged as a powerful framework for enhancing reasoning capabilities of reasoning models. While multi-turn table reasoning methods have improved reasoning accuracy through tool use and reward modeling, they rely on fixed text serialization for table state readouts. This introduces representation errors in table encoding that significantly accumulate over multiple turns. Such accumulation is alleviated by tabular grounding methods in the expense of inference compute and cost, rendering real world deployment impractical. To address this, we introduce TABQAWORLD, a table reasoning framework that jointly optimizes tabular action through representation and estimation. For representation, TABQAWORLD employs an action-conditioned multimodal selection policy, which dynamically switches between visual and textual representations to maximize table state readout reliability. For estimation, TABQAWORLD optimizes stepwise reasoning trajectory through table metadata including dimension, data types and key values, safely planning trajectory and compressing low-complexity actions to reduce conversation turns and latency. Designed as a training-free framework, empirical evaluations show that TABQAWORLD achieves state-of-the-art performance with 4.87% accuracy improvements over baselines, with 5.42% accuracy gain and 33.35% inference latency reduction over static settings, establishing a new standard for reliable and efficient table reasoning.

TABQAWORLD: Optimizing Multimodal Reasoning for Multi-Turn Table Question Answering

Abstract

Multimodal reasoning has emerged as a powerful framework for enhancing reasoning capabilities of reasoning models. While multi-turn table reasoning methods have improved reasoning accuracy through tool use and reward modeling, they rely on fixed text serialization for table state readouts. This introduces representation errors in table encoding that significantly accumulate over multiple turns. Such accumulation is alleviated by tabular grounding methods in the expense of inference compute and cost, rendering real world deployment impractical. To address this, we introduce TABQAWORLD, a table reasoning framework that jointly optimizes tabular action through representation and estimation. For representation, TABQAWORLD employs an action-conditioned multimodal selection policy, which dynamically switches between visual and textual representations to maximize table state readout reliability. For estimation, TABQAWORLD optimizes stepwise reasoning trajectory through table metadata including dimension, data types and key values, safely planning trajectory and compressing low-complexity actions to reduce conversation turns and latency. Designed as a training-free framework, empirical evaluations show that TABQAWORLD achieves state-of-the-art performance with 4.87% accuracy improvements over baselines, with 5.42% accuracy gain and 33.35% inference latency reduction over static settings, establishing a new standard for reliable and efficient table reasoning.

Paper Structure

This paper contains 52 sections, 2 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: Motivation and overview of TabQAWorld. Fixed text serialization introduces state tracking noise (representation bottleneck), which propagates across multi-step reasoning and causes trajectory drift (estimation bottleneck). TabQAWorld addresses such failure process by jointly optimizing what to see and what to expect .
  • Figure 2: An illustrative example of how image-based parsing facilitates more human-preferred column attention than text-serialized tables. The value below each table indicates the mean-squared error (MSE) against human-preferred binary attention.
  • Figure 3: Hallucinations in full table estimation from frontier GPT-5.4 motivate lower-dimensional state estimation.
  • Figure 4: TabQAWorld dynamically selects the optimal data modality based on task purposes, and optimizes reasoning trajectory based on low dimensional metadata to minimize token usage and latency while maintaining a rigorous feedback loop to ensure convergence on an accurate final answer.
  • Figure 5: Illustration of metadata-guided execution. A mismatch in key_output (1964 vs. 1968) triggers replanning.
  • ...and 4 more figures