Table of Contents
Fetching ...

TANQ: An open domain dataset of table answered questions

Mubashara Akhtar, Chenxi Pang, Andreea Marzoca, Yasemin Altun, Julian Martin Eisenschlos

TL;DR

TANQ addresses the challenge of answering open-domain questions by constructing table-form answers that synthesize information from multiple sources. The authors present a five-step automated pipeline leveraging QAMPARI as a seed, Wikidata for relation extension, and Wikipedia for multi-modal evidence, culminating in per-cell source attribution within the answer tables. They evaluate state-of-the-art models in closed-book, oracle, and open-book settings, finding that even strong baselines lag behind human performance, with an overall F1 of 60.7 in the oracle setting and notable weaknesses in numeracy, table formatting, and complex reasoning. The work highlights the complexity of multi-source table QA, offers a detailed analysis of failure modes, and lays the groundwork for future improvements in dataset construction, evaluation metrics, and model capabilities for producing richly structured, evidence-backed tabular answers.

Abstract

Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, Gemini Flash reaches an overall F1 score of 60.7, lagging behind human performance by 12.3 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.

TANQ: An open domain dataset of table answered questions

TL;DR

TANQ addresses the challenge of answering open-domain questions by constructing table-form answers that synthesize information from multiple sources. The authors present a five-step automated pipeline leveraging QAMPARI as a seed, Wikidata for relation extension, and Wikipedia for multi-modal evidence, culminating in per-cell source attribution within the answer tables. They evaluate state-of-the-art models in closed-book, oracle, and open-book settings, finding that even strong baselines lag behind human performance, with an overall F1 of 60.7 in the oracle setting and notable weaknesses in numeracy, table formatting, and complex reasoning. The work highlights the complexity of multi-source table QA, offers a detailed analysis of failure modes, and lays the groundwork for future improvements in dataset construction, evaluation metrics, and model capabilities for producing richly structured, evidence-backed tabular answers.

Abstract

Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, Gemini Flash reaches an overall F1 score of 60.7, lagging behind human performance by 12.3 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.
Paper Structure (48 sections, 7 figures, 16 tables)

This paper contains 48 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: An example question in TANQ and its corresponding table answer. Supporting evidence from multiple pages in a Wikipedia snapshot is provided for each data point inside the table. We highlight the rationale inside each snippet in yellow. LLMs are evaluated with or without access to the evidence.
  • Figure 2: TANQ creation pipeline consisting of five steps: 1. Extending QAMPARI questions with additional relations based on Wikidata; 2. Evidence extraction (text, tables, infoboxes) from Wikipedia articles; 3. Evidence evaluation and (gold) answer table generation; 4. Rephrasing of question from first step; 5. Augmentation with additional skills to generate complex. We include a running example at the bottom.
  • Figure 3: Prompts for baseline evaluation on TANQ: closed book, oracle, and open book with augmented tools.
  • Figure 4: Prompt used for evidence evaluation. We prompt a language model to evaluate the extracted evidence in a natural language inference setting. The LLM labels the input statements as "verifiable" or "not verifiable" based on evidence provided in form of a sentence, table or infobox.
  • Figure 5: Rephrasing questions increases the naturalness of template-based extension questions. To ensure the question meaning is preserved during rephrasing, we add structured annotations in parenthesis to questions with the name of each relation.
  • ...and 2 more figures