Table of Contents
Fetching ...

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Hexuan Wang, Yaxuan Ren, Srikar Bommireddypalli, Shuxian Chen, Adarsh Prabhudesai, Rongkun Zhou, Elina Baral, Philipp Koehn

TL;DR

This work introduces SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation, and reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies.

Abstract

We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

TL;DR

This work introduces SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation, and reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies.

Abstract

We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
Paper Structure (43 sections, 11 figures, 7 tables)

This paper contains 43 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: An example from our dataset that requires complex computations: After identifying the relevant values (the rows for Qwen2-Audio), compute the average accuracy across settings for each language and identify the lowest-performing (hardest) language. Repeat this for all other models.
  • Figure 2: Example for a manually created pseudo code plan. The plan corresponds to the example question in Figure \ref{['fig:example']}.
  • Figure 3: Model Agreement Matrix. Rows represent questions sorted by difficulty, defined as the total number of models that answered correctly (easiest at top). Columns represent models sorted by accuracy. Blue indicates a correct prediction. A solid gray band at the bottom highlights questions unsolved by any system.
  • Figure 4: Impact of Table Scale on Performance. We stratify model accuracy by the Structural Size ($S_{cell}$) of the table. While frontier models (faint lines at top) remain relatively stable, smaller models and code-based systems (faint lines at bottom) degrade significantly as table size increases. Sample sizes ($N$) for each bin are denoted on the x-axis.
  • Figure 5: Impact of Reasoning Complexity. We measure accuracy against four algorithmic hardness metrics: Calculation Intensity ($I_{calc}$), Retrieval Demand ($I_{retr}$), Plan Horizon ($L_{plan}$), and Control Flow ($C_{flow}$). Across all dimensions, increased complexity correlates with lower performance. Sample sizes ($N$) are provided on the x-axis; trends at the tails should be interpreted in the context of smaller bin sizes.
  • ...and 6 more figures