Table of Contents
Fetching ...

Denoising Table-Text Retrieval for Open-Domain Question Answering

Deokhyung Kang, Baikjin Jung, Yunsu Kim, Gary Geunbae Lee

TL;DR

Denoised Table-Text Retriever (DoTTeR) tackles false-positive supervision and limited table-level reasoning in table-text ODQA by (i) denoising training data with a false-positive detector to keep only the most relevant fused block per question, and (ii) introducing RATE, a rank-aware table encoder that injects min/max column information into the block representations. The approach combines a denoising stage with a rank-augmented retrieval model, and trains a Cross-Block Reader to extract answers from top blocks. Empirically, DoTTeR outperforms strong baselines on OTT-QA in both retrieval (block/table recall) and downstream QA (EM/F1), with ablations confirming the effectiveness of denoising and ranking information. The work demonstrates that table-level signals can significantly improve evidence selection and QA performance in open-domain settings over both tabular and textual data.

Abstract

In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.

Denoising Table-Text Retrieval for Open-Domain Question Answering

TL;DR

Denoised Table-Text Retriever (DoTTeR) tackles false-positive supervision and limited table-level reasoning in table-text ODQA by (i) denoising training data with a false-positive detector to keep only the most relevant fused block per question, and (ii) introducing RATE, a rank-aware table encoder that injects min/max column information into the block representations. The approach combines a denoising stage with a rank-augmented retrieval model, and trains a Cross-Block Reader to extract answers from top blocks. Empirically, DoTTeR outperforms strong baselines on OTT-QA in both retrieval (block/table recall) and downstream QA (EM/F1), with ablations confirming the effectiveness of denoising and ranking information. The work demonstrates that table-level signals can significantly improve evidence selection and QA performance in open-domain settings over both tabular and textual data.

Abstract

In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.
Paper Structure (23 sections, 3 equations, 3 figures, 2 tables)

This paper contains 23 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example of a question and related table in OTT-QA. Two fused blocks contain the answer "Sydney" to the question, but only the blue-bordered block satisfies the conditions required by the question.
  • Figure 2: An overview of the encoding process for a fused block $b$ with RATE. The fused block $b$ belongs to the table on the left and is encoded into $E_{B}(b)$ from the concatenation of the rank embedding, extracted from the rank-aware column encoder, and the input embedding.
  • Figure 3: Top-1 fused blocks retrieved by OTTeR and DoTTeR, respectively.