Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

Sebastian Hofstätter; Jiecao Chen; Karthik Raman; Hamed Zamani

Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

Sebastian Hofstätter, Jiecao Chen, Karthik Raman, Hamed Zamani

TL;DR

The paper tackles label noise in knowledge-intensive generation by introducing relevance-based confidence sampling: filtering training pairs with a threshold $t$ on the relevance confidence $\Phi(p, q, a)$ and using a BLEU-based signal from KILT. It trains a single Fusion-in-Decoder (FiD) model on seven KILT tasks, experiment with alternative retrievable units (200-word passages), and scales generator capacity with T5X backbones, achieving state-of-the-art results on five of seven tasks. Key findings include large gains on TriviaQA ($+12.7$ EM) and T-REx ($+4.9$ Accuracy), robust improvements with larger models, and a careful analysis showing gains are not due to benchmark gaming. The approach demonstrates that data-driven cleaning and multi-task training can substantially boost RAG-style systems without retraining the retriever, informing practical design choices for knowledge-intensive NLP.

Abstract

This paper studies multi-task training of retrieval-augmented generation models for knowledge-intensive tasks. We propose to clean the training set by utilizing a distinct property of knowledge-intensive generation: The connection of query-answer pairs to items in the knowledge base. We filter training examples via a threshold of confidence on the relevance labels, whether a pair is answerable by the knowledge base or not. We train a single Fusion-in-Decoder (FiD) generator on seven combined tasks of the KILT benchmark. The experimental results suggest that our simple yet effective approach substantially improves competitive baselines on two strongly imbalanced tasks; and shows either smaller improvements or no significant regression on the remaining tasks. Furthermore, we demonstrate our multi-task training with relevance label sampling scales well with increased model capacity and achieves state-of-the-art results in five out of seven KILT tasks.

Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

TL;DR

The paper tackles label noise in knowledge-intensive generation by introducing relevance-based confidence sampling: filtering training pairs with a threshold

on the relevance confidence

and using a BLEU-based signal from KILT. It trains a single Fusion-in-Decoder (FiD) model on seven KILT tasks, experiment with alternative retrievable units (200-word passages), and scales generator capacity with T5X backbones, achieving state-of-the-art results on five of seven tasks. Key findings include large gains on TriviaQA (

EM) and T-REx (

Accuracy), robust improvements with larger models, and a careful analysis showing gains are not due to benchmark gaming. The approach demonstrates that data-driven cleaning and multi-task training can substantially boost RAG-style systems without retraining the retriever, informing practical design choices for knowledge-intensive NLP.

Abstract

Paper Structure (17 sections, 2 equations, 2 figures, 2 tables)

This paper contains 17 sections, 2 equations, 2 figures, 2 tables.

Introduction
Relevance-Based Confidence Sampling
Experiment Design
KILT multi-task training.
Alternative retrievable units.
Implementation.
Evaluation.
Results
Sampling strategies.
Retrievable units.
Scaling the generator capacity.
Leaderboard comparison.
Are we just gaming the benchmark?
Related Work
Multi-task training.
...and 2 more sections

Figures (2)

Figure 1: Training examples per task and sampling method. Hatched bars indicate downsampling with potentially more training data available.
Figure 2: Statistics of the passage lengths of the raw KILT texts, its original chunking (Orig-100) and our alternative approach (Alt-200). The word counts are binned to 10 words.

Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

TL;DR

Abstract

Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (2)