Table of Contents
Fetching ...

p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich

TL;DR

This work addresses the under-utilization of training data and the lack of post-training gains in table question answering (TQA). It introduces p2-TQA, a three-stage, process-based preference learning framework that converts model-generated reasoning traces into stateful data, estimates state values via Monte Carlo rollouts, and constructs high-quality pairwise step preferences for direct optimization, all without additional manual data. Empirically, p2-TQA yields up to about $5\%$ in-domain and $2.4\%$ out-of-domain improvements using only $8{,}000$ training instances and achieves competitive results with significantly lower inference cost compared to larger state-of-the-art systems. The method demonstrates a practical, data-efficient path to self-improvement in TQA and potentially other reasoning-heavy tasks, highlighting the value of structured, process-aware post-training.

Abstract

Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p2-TQA, a process-based preference learning framework for TQA post-training. p2-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p2-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p2-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

TL;DR

This work addresses the under-utilization of training data and the lack of post-training gains in table question answering (TQA). It introduces p2-TQA, a three-stage, process-based preference learning framework that converts model-generated reasoning traces into stateful data, estimates state values via Monte Carlo rollouts, and constructs high-quality pairwise step preferences for direct optimization, all without additional manual data. Empirically, p2-TQA yields up to about in-domain and out-of-domain improvements using only training instances and achieves competitive results with significantly lower inference cost compared to larger state-of-the-art systems. The method demonstrates a practical, data-efficient path to self-improvement in TQA and potentially other reasoning-heavy tasks, highlighting the value of structured, process-aware post-training.

Abstract

Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p2-TQA, a process-based preference learning framework for TQA post-training. p2-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p2-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p2-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: An overview of p2-TQA: An existing model generates reasoning chains for a given problem. The chains are parsed into states, composed of cumulative steps. Each state is scored by a value function. We then create pairwise steps by rolling out parent states, selecting those with value differences exceeding a threshold. Lastly, contrastive learning is performed over collected data to improve the TQA model.
  • Figure 2: Process-based preference data collection. We estimate a state value by the probability of a state leading to a correct answer. In the first example, $V(s_i)=\frac{2}{3}$. After obtaining state values, we do not consider intermediate states that have a value of 0 ($s_{21}$), together with their child states. We sample pair-wise states for each remaining state, e.g., $s'_{22}$ is sampled by rolling out $s_{12}$ and is regarded as a pair state for $s_{22}$.
  • Figure 3: Comparing p2-TQA with baselines using Exact Match. Results are averaged across models. RFT and FDPO stand for rejected sampling fine-tuning and full-chain DPO, respectively. We experiment with several value functions: Self-Exp (Self-Exploration), mc-b (Monte Carlo with binary values), and mix (a combination of LLM-as-a-judge and mc-b). Dashed lines show performances of fine-tuned TQA models $M_{ft}$ before applying self-improvement methods.
  • Figure 4: Exact Match of models with and without p2-TQA, evaluated across different table sizes and averaged over in-domain datasets. Qw and LM show the performance of Qwen $M{ft}$ and LlaMA $M{ft}$ respectively. Instances are grouped into three bins by table token count: $<500$, $500$--$1000$, and $\ge 1000$.
  • Figure 5: Thresholds comparisons with different value functions on six TQA datasets.
  • ...and 4 more figures