FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

Wei Zhou; Mohsen Mesgar; Heike Adel; Annemarie Friedrich

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich

TL;DR

FREB-TQA introduces a fine-grained robustness benchmark for Table Question Answering, formalizing three essential desiderata: retrieval robustness under table-structure perturbations, faithful attention to relevant cells, and robust numerical reasoning during aggregation. Built from four datasets, it employs seven automatic perturbations plus manual annotations to generate 75,205 instances across 8,590 questions, enabling diagnostics across end-to-end, pipeline, and LLM-based TQA systems. Across extensive experiments, no state-of-the-art model consistently excels in all aspects; end-to-end systems resist table-structure changes but struggle with numeric reasoning, LLMs handle numeric operations but show sensitivity to input length and structure, and pipeline models maintain faithfulness via symbolic execution. The benchmark offers a practical tool for driving robust TQA research, suggesting that hybrid pipeline approaches and future enhancements for long-table handling and multilingual data are promising avenues for real-world applications.

Abstract

Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 9 figures, 12 tables)

This paper contains 36 sections, 1 equation, 9 figures, 12 tables.

Introduction
Related Work
TQA systems.
Robustness evaluation.
Our Benchmark: FREB-TQA
Source Datasets
Extraction and Reasoning Questions
Question type classification for WTQ.
Perturbations for Testing Retrieval Robustness against Table Structure Changes
Shuffle all rows (columns).
Shift target rows (columns).
Transpose.
Perturbations for Testing Attention to Relevant Cells
Remove relevant cells.
Remove table.
...and 21 more sections

Figures (9)

Figure 1: Our benchmark addresses three aspects of robustness shown in the yellow boxes. Answers are bold in tables. Original tables (left) are what exist in a TQA dataset and changed tables (right) show tables after perturbations. We demonstrate three perturbations in this figure (top to bottom): table transposing, removing relevant cells and modifying values to change answers.
Figure 2: An example of aggregation/comparison robustness in case of string change. ① illustrates shortening an original table to table cells on which numerical aggregations or comparisons operate. ② illustrates modifications on a shortened table, leading either to a change in answer or not. The orange parts mark changed strings.
Figure 3: Exact match difference (Emd) on retrieval robustness against table structure changes perturbations for extraction questions, averaged across four datasets and seeds. R, C and stand for row and column. SA, TT,TM, TB and TF stand for shuffle all, target top, target middle, target bottom/back, target front, respectively.
Figure 4: Variation percentage (VP) on retrieval robustness against table structure changes perturbations for extraction questions, averaged across four datasets and seeds. R, C and stand for row and column. SA, TT,TM, TB and TF stand for shuffle all, target top, target middle, target bottom/back, target front, respectively.
Figure 5: LLaMA2 Prompt for classifying extraction and reasoning questions.
...and 4 more figures

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

TL;DR

Abstract

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (9)