Table of Contents
Fetching ...

ACER: Automatic Language Model Context Extension via Retrieval

Luyu Gao, Yunyi Zhang, Jamie Callan

TL;DR

ACER addresses the challenge of extending short-context language models to long-context tasks without relying on labor-intensive labeled long-context data. It introduces a two-stage pipeline: automatic data synthesis using a ranker to select relevant chunks and a short-context generator to produce an answer with chain-of-thought, followed by fine-tuning a long-context LM on the synthetic CoT data. Empirical results show ACER outperforming contemporary open-weight long-context models on long-context RAG tasks and narrative reading comprehension, and achieving robustness to different retrievers, including BM25. The work demonstrates a cost-effective, self-supervised path to enhance long-context understanding, with implications for task-specific long-context applications where data is scarce.

Abstract

Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in real-world tasks such as long-context retrieval augmented generation.

ACER: Automatic Language Model Context Extension via Retrieval

TL;DR

ACER addresses the challenge of extending short-context language models to long-context tasks without relying on labor-intensive labeled long-context data. It introduces a two-stage pipeline: automatic data synthesis using a ranker to select relevant chunks and a short-context generator to produce an answer with chain-of-thought, followed by fine-tuning a long-context LM on the synthetic CoT data. Empirical results show ACER outperforming contemporary open-weight long-context models on long-context RAG tasks and narrative reading comprehension, and achieving robustness to different retrievers, including BM25. The work demonstrates a cost-effective, self-supervised path to enhance long-context understanding, with implications for task-specific long-context applications where data is scarce.

Abstract

Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in real-world tasks such as long-context retrieval augmented generation.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The full process of ACER involves a data synthesis stage and a fine-tuning stage. (top) The data synthesis stage splits and retrieves a set of relevant text chunks for a problem and use a short-context model to generate an answer with CoT cot. (bottom) The fine-tuning stage use the original long-context data and the synthetic CoT answer to fine-tune a long-context model.
  • Figure 2: Prompts given to the LM to produce relevance judgement.
  • Figure 3: Prompts given to the LM to produce the final CoT answer.
  • Figure 4: Performance comparison between Llama3.1 and ACER when reading different context sizes on the Natural Question and Trivia QA datasets.