The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer
TL;DR
The paper introduces ToRR, a comprehensive benchmark for table reasoning and robustness that evaluates 14 LLMs across 10 diverse datasets and 6 tabular tasks. It systematically varies input representations through 7 serializations and 4 structural perturbations, creating 35 prompt configurations per example to assess performance and robustness with clearly defined metrics $\mathcal{P}_M$ and $\mathcal{R}_M$. Key findings show widespread brittleness: even strong models exhibit moderate absolute performance ($\approx$0.5 or less) and high sensitivity to formatting, with robustness correlating imperfectly with size; no single table format dominates, underscoring the need for multi-configuration testing. The study demonstrates that using multiple prompts can substantially improve reliability, sometimes equating the benefit of adding more test data, and highlights ToRR's value in guiding robust, real-world evaluation practices for tabular reasoning with LLMs. The results advocate broader adoption of diverse prompt configurations and more nuanced evaluation protocols to better capture practical capabilities and limitations in table understanding.
Abstract
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
