How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Akash Ghosh; B Venkata Sahith; Niloy Ganguly; Pawan Goyal; Mayank Singh

How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Akash Ghosh, B Venkata Sahith, Niloy Ganguly, Pawan Goyal, Mayank Singh

TL;DR

This work introduces SciTabQA, a benchmark for answering questions over hybrid scientific tables and descriptive text, and evaluates three state-of-the-art Tabular QA systems (TAPAS, TAPEX, OmniTab) on 822 QA pairs from 198 scientific tables. Formally, the task uses a context $T'=(T,c,d)$ and seeks the answer $a$ by maximizing $p(a|q,T')$; experiments reveal that even strong models struggle on this dataset, with OmniTab achieving the best F1 around 0.462 and caption/description context often hindering performance. The authors provide detailed dataset creation, annotation guidelines, and analysis of factors like caption usage, truncation, and transfer learning, highlighting the need for better models capable of multi-modal scientific reasoning. The work contributes a publicly available benchmark, a multicategory annotation scheme, and empirical insights into the challenges of robustly interpreting scientific tables alongside text, guiding future research in domain-specific QA for scientific documentation.

Abstract

Question-answering (QA) on hybrid scientific tabular and textual data deals with scientific information, and relies on complex numerical reasoning. In recent years, while tabular QA has seen rapid progress, understanding their robustness on scientific information is lacking due to absence of any benchmark dataset. To investigate the robustness of the existing state-of-the-art QA models on scientific hybrid tabular data, we propose a new dataset, "SciTabQA", consisting of 822 question-answer pairs from scientific tables and their descriptions. With the help of this dataset, we assess the state-of-the-art Tabular QA models based on their ability (i) to use heterogeneous information requiring both structured data (table) and unstructured data (text) and (ii) to perform complex scientific reasoning tasks. In essence, we check the capability of the models to interpret scientific tables and text. Our experiments show that "SciTabQA" is an innovative dataset to study question-answering over scientific heterogeneous data. We benchmark three state-of-the-art Tabular QA models, and find that the best F1 score is only 0.462.

How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

TL;DR

and seeks the answer

by maximizing

; experiments reveal that even strong models struggle on this dataset, with OmniTab achieving the best F1 around 0.462 and caption/description context often hindering performance. The authors provide detailed dataset creation, annotation guidelines, and analysis of factors like caption usage, truncation, and transfer learning, highlighting the need for better models capable of multi-modal scientific reasoning. The work contributes a publicly available benchmark, a multicategory annotation scheme, and empirical insights into the challenges of robustly interpreting scientific tables alongside text, guiding future research in domain-specific QA for scientific documentation.

Abstract

Paper Structure (15 sections, 1 equation, 2 figures, 6 tables)

This paper contains 15 sections, 1 equation, 2 figures, 6 tables.

Introduction
Problem formulation
Dataset
Data collection and preprocessing
Annotation Guidelines
Question tagging
Inter-annotator agreement
Baselines
Experiments
Results and Analysis
Adding caption and description
Truncation statistics
Transfer learning TableQA tasks
Conclusion
Limitations

Figures (2)

Figure 1: Scientific Hybrid Table Question Answering: for various questions, additional information from table captions, as well as table descriptions, may be required to come up with the appropriate answers. For instance, in the example, 'instances without superficial cues' is understood only from the description.
Figure 2: Data pre-processing and collection from SciGen to SciTabQA dataset.

How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

TL;DR

Abstract

How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (2)