Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling
Sebastian Hofstätter, Jiecao Chen, Karthik Raman, Hamed Zamani
TL;DR
The paper tackles label noise in knowledge-intensive generation by introducing relevance-based confidence sampling: filtering training pairs with a threshold $t$ on the relevance confidence $\Phi(p, q, a)$ and using a BLEU-based signal from KILT. It trains a single Fusion-in-Decoder (FiD) model on seven KILT tasks, experiment with alternative retrievable units (200-word passages), and scales generator capacity with T5X backbones, achieving state-of-the-art results on five of seven tasks. Key findings include large gains on TriviaQA ($+12.7$ EM) and T-REx ($+4.9$ Accuracy), robust improvements with larger models, and a careful analysis showing gains are not due to benchmark gaming. The approach demonstrates that data-driven cleaning and multi-task training can substantially boost RAG-style systems without retraining the retriever, informing practical design choices for knowledge-intensive NLP.
Abstract
This paper studies multi-task training of retrieval-augmented generation models for knowledge-intensive tasks. We propose to clean the training set by utilizing a distinct property of knowledge-intensive generation: The connection of query-answer pairs to items in the knowledge base. We filter training examples via a threshold of confidence on the relevance labels, whether a pair is answerable by the knowledge base or not. We train a single Fusion-in-Decoder (FiD) generator on seven combined tasks of the KILT benchmark. The experimental results suggest that our simple yet effective approach substantially improves competitive baselines on two strongly imbalanced tasks; and shows either smaller improvements or no significant regression on the remaining tasks. Furthermore, we demonstrate our multi-task training with relevance label sampling scales well with increased model capacity and achieves state-of-the-art results in five out of seven KILT tasks.
