CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Xiaoxi Li; Zhicheng Dou; Yujia Zhou; Fangchao Liu

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, Fangchao Liu

TL;DR

CorpusLM addresses hallucination in knowledge-intensive tasks by unifying generative retrieval, closed-book generation, and retrieval-augmented generation within a single greedy decoding framework. It introduces a ranking-oriented DocID list generation for end-to-end RAG, a continuous DocIDs-References-Answer decoding strategy to streamline retrieval and answer synthesis, and unsupervised DocID understanding tasks to align DocID semantics with downstream goals. The model is trained via multi-task learning with a combined loss and evaluated on the KILT benchmark using T5 and Llama2 backbones, achieving superior retrieval and downstream performance over strong baselines. The approach reduces memory and latency and demonstrates strong potential for integrating retrieval into a single generative language model, with implications for scalable, knowledge-grounded AI systems.

Abstract

Large language models (LLMs) have gained significant attention in various fields but prone to hallucination, especially in knowledge-intensive (KI) tasks. To address this, retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy. However, traditional retrieval modules often rely on large document index and disconnect with generative tasks. With the advent of generative retrieval (GR), language models can retrieve by directly generating document identifiers (DocIDs), offering superior performance in retrieval tasks. However, the potential relationship between GR and downstream tasks remains unexplored. In this paper, we propose \textbf{CorpusLM}, a unified language model that leverages external corpus to tackle various knowledge-intensive tasks by integrating generative retrieval, closed-book generation, and RAG through a unified greedy decoding process. We design the following mechanisms to facilitate effective retrieval and generation, and improve the end-to-end effectiveness of KI tasks: (1) We develop a ranking-oriented DocID list generation strategy, which refines GR by directly learning from a DocID ranking list, to improve retrieval quality. (2) We design a continuous DocIDs-References-Answer generation strategy, which facilitates effective and efficient RAG. (3) We employ well-designed unsupervised DocID understanding tasks, to comprehend DocID semantics and their relevance to downstream tasks. We evaluate our approach on the widely used KILT benchmark with two variants of backbone models, i.e., T5 and Llama2. Experimental results demonstrate the superior performance of our models in both retrieval and downstream tasks.

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

TL;DR

Abstract

Paper Structure (28 sections, 10 equations, 3 figures, 6 tables)

This paper contains 28 sections, 10 equations, 3 figures, 6 tables.

Introduction
Related Work
Methodology
Task Formulation
CorpusLM: the Unified Language Model
Generative Retrieval: DocID List Generation
Training Data Construction
Training Objective
Inference Constraints
Closed-book Answer Generation
RAG: Continuous Generation Strategy
Unsupervised DocID Understanding Tasks
Experimental Settings
Datasets
Evaluation Metrics
...and 13 more sections

Figures (3)

Figure 1: Overview of the CorpusLM framework. We aim to develop a unified language model that utilizes external corpus to handle various knowledge-intensive tasks by integrating generative retrieval, closed-book generation, and RAG. To effectively accomplish all tasks using unified greedy decoding, we propose a ranking-oriented DocID list generation strategy to improve generative retrieval performance; and a continuous RAG strategy to sequentially decode DocID ranking list, references and answer. We also enhance the model's comprehension of DocID semantics through unsupervised DocID understanding tasks.
Figure 2: Illustration of the two generation strategies for retrieval and RAG. (a) Ranking-oriented DocID list generation strategy. We utilize a prefix tree built from DocIDs in the corpus to dynamically add constraints for generating valid and non-repeated DocID ranking list. (b) Continuous generation strategy. It comprises three continuous decoding steps: (1) decode DocIDs and map them to the corresponding documents; (2) decode fine-grained references from the documents; (3) decode the final answer.
Figure 3: Analysis of ranking capability of CorpusLM for retrieval tasks. We compare our T5-based CorpusLM with dense retriever SimLM and another T5-based generative retriever with traditional beam search, focusing on Recall@{1, 5, 10}.

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

TL;DR

Abstract

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)