Table of Contents
Fetching ...

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer

TL;DR

The paper introduces ToRR, a comprehensive benchmark for table reasoning and robustness that evaluates 14 LLMs across 10 diverse datasets and 6 tabular tasks. It systematically varies input representations through 7 serializations and 4 structural perturbations, creating 35 prompt configurations per example to assess performance and robustness with clearly defined metrics $\mathcal{P}_M$ and $\mathcal{R}_M$. Key findings show widespread brittleness: even strong models exhibit moderate absolute performance ($\approx$0.5 or less) and high sensitivity to formatting, with robustness correlating imperfectly with size; no single table format dominates, underscoring the need for multi-configuration testing. The study demonstrates that using multiple prompts can substantially improve reliability, sometimes equating the benefit of adding more test data, and highlights ToRR's value in guiding robust, real-world evaluation practices for tabular reasoning with LLMs. The results advocate broader adoption of diverse prompt configurations and more nuanced evaluation protocols to better capture practical capabilities and limitations in table understanding.

Abstract

Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

TL;DR

The paper introduces ToRR, a comprehensive benchmark for table reasoning and robustness that evaluates 14 LLMs across 10 diverse datasets and 6 tabular tasks. It systematically varies input representations through 7 serializations and 4 structural perturbations, creating 35 prompt configurations per example to assess performance and robustness with clearly defined metrics and . Key findings show widespread brittleness: even strong models exhibit moderate absolute performance (0.5 or less) and high sensitivity to formatting, with robustness correlating imperfectly with size; no single table format dominates, underscoring the need for multi-configuration testing. The study demonstrates that using multiple prompts can substantially improve reliability, sometimes equating the benefit of adding more test data, and highlights ToRR's value in guiding robust, real-world evaluation practices for tabular reasoning with LLMs. The results advocate broader adoption of diverse prompt configurations and more nuanced evaluation protocols to better capture practical capabilities and limitations in table understanding.

Abstract

Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.

Paper Structure

This paper contains 56 sections, 3 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Overview of ToRR. We evaluate LLMs on several tabular reasoning datasets. We apply multiple prompt configurations, consisting of table serializations (methods for representing the table as a string), and an optional perturbation to the table structure. Our results explore model performance and the effects of prompt variability. Our analysis demonstrates that for any number of examples, testing more prompt configurations increases the evaluation reliability.
  • Figure 2: Examples of prompt configurations within ToRR. Each prompt configuration contains the same instructions, but uses a different formatting of the input tables. On the left, the table is configured in markdown format, while on the right, it is serialized using CSV with row shuffling applied to the original table (§\ref{['sec: table_forms']}). While all prompts convey the same information, even state-of-the-art models struggle with solving them consistently (§\ref{['section:results']}).
  • Figure 3: For each example we obtained $35$ performance scores using different prompt configurations. The example scores are assigned an index in the range $[1, 35]$, ordered from lowest to highest performance. The plot depicts an average aggregation of each index across all examples. Models exhibit a wide range of scores, reflecting low robustness.
  • Figure 4: The separability score for each dataset in ToRR. This score represents the proportion of model pairs that can be distinguished with confidence, meaning their confidence intervals (CIs; via bootstrapping over $1$K seeds) do not overlap.
  • Figure 5: Model ranking agreement between the datasets in ToRR.
  • ...and 13 more figures