Table of Contents
Fetching ...

RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations

Yilun Zhao, Chen Zhao, Linyong Nan, Zhenting Qi, Wenlin Zhang, Xiangru Tang, Boyu Mi, Dragomir Radev

TL;DR

RobuT introduces the first diagnostic benchmark for Table QA robustness, exposing significant fragility of state-of-the-art models to adversarial perturbations across table headers, contents, and NLQ. The authors show that large language models provide stronger robustness, motivating LeTA, a framework that uses LLM prompting to generate adversarial training data and substantially improves robustness. By combining human-annotated perturbations with LLM-generated augmentation, RobuT and LeTA offer a scalable path to more trustworthy Table QA systems. The work highlights practical implications for deploying Table QA in real-world settings and points to directions for future robustness research.

Abstract

Despite significant progress having been made in question answering on tabular data (Table QA), it's unclear whether, and to what extent existing Table QA models are robust to task-specific perturbations, e.g., replacing key question entities or shuffling table columns. To systematically study the robustness of Table QA models, we propose a benchmark called RobuT, which builds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) and includes human-annotated adversarial perturbations in terms of table header, table content, and question. Our results indicate that both state-of-the-art Table QA models and large language models (e.g., GPT-3) with few-shot learning falter in these adversarial sets. We propose to address this problem by using large language models to generate adversarial examples to enhance training, which significantly improves the robustness of Table QA models. Our data and code is publicly available at https://github.com/yilunzhao/RobuT.

RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations

TL;DR

RobuT introduces the first diagnostic benchmark for Table QA robustness, exposing significant fragility of state-of-the-art models to adversarial perturbations across table headers, contents, and NLQ. The authors show that large language models provide stronger robustness, motivating LeTA, a framework that uses LLM prompting to generate adversarial training data and substantially improves robustness. By combining human-annotated perturbations with LLM-generated augmentation, RobuT and LeTA offer a scalable path to more trustworthy Table QA systems. The work highlights practical implications for deploying Table QA in real-world settings and points to directions for future robustness research.

Abstract

Despite significant progress having been made in question answering on tabular data (Table QA), it's unclear whether, and to what extent existing Table QA models are robust to task-specific perturbations, e.g., replacing key question entities or shuffling table columns. To systematically study the robustness of Table QA models, we propose a benchmark called RobuT, which builds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) and includes human-annotated adversarial perturbations in terms of table header, table content, and question. Our results indicate that both state-of-the-art Table QA models and large language models (e.g., GPT-3) with few-shot learning falter in these adversarial sets. We propose to address this problem by using large language models to generate adversarial examples to enhance training, which significantly improves the robustness of Table QA models. Our data and code is publicly available at https://github.com/yilunzhao/RobuT.
Paper Structure (42 sections, 6 figures, 9 tables)

This paper contains 42 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Examples of adversarial perturbation over table header (blue), table content (orange), and question (purple). Table QA model predicts a correct answer on the original example but fails on perturbed ones.
  • Figure 2: Overview of adversarial annotation process to collect perturbed NLQs in word-level and sentence-level using a model in the loop. $\bm{A_{orig}}$ is the answer predicted by the Table QA model (i.e., TaBERT-small), given the table $\bm{T}$ and pre-perturbed question $\bm{Q_{orig}}$.
  • Figure 3: An example of GPT-3 "chain-of-thought‚Äù prompt prefix for the Table QA tasks.
  • Figure 4: An example of prompt prefix for header synonym replacement using GPT-3. The GPT-3 model is prompted to perturb the table header, given the table context (i.e., table header, and first two rows of the table).
  • Figure 5: An example of prompt prefix for column adding perturbation using CodeX. The candidate table is retrieved by the TaPas-based dense retriever. The CodeX model is prompted to select one column from the candidate table that can be inserted into the source table.
  • ...and 1 more figures