Table of Contents
Fetching ...

Equipping Transformer with Random-Access Reading for Long-Context Understanding

Chenghao Yang, Zi Yang, Nan Hua

TL;DR

The paper tackles the challenge of long-context understanding in transformers by addressing the quadratic cost of self-attention and length extrapolation from short pretraining. It introduces Random-Access Reading, a framework that uses a data server and a simple, confidence-driven skip policy to skip token blocks during reading, with an optional memory module (Attendre) to preserve coherence across skipped segments. The approach is validated through pretraining, finetuning, and long-context question answering on tasks like the C4 corpus and TriviaQA, demonstrating improved efficiency and performance, including near sublinear complexity when memory is included. These results imply that dynamic, query-driven reading strategies can substantially reduce computation for long-context tasks while enabling effective adaptation of short-context models to long contexts, with practical impact on interactive LLM systems and long-document understanding.

Abstract

Long-context modeling presents a significant challenge for transformer-based large language models (LLMs) due to the quadratic complexity of the self-attention mechanism and issues with length extrapolation caused by pretraining exclusively on short inputs. Existing methods address computational complexity through techniques such as text chunking, the kernel approach, and structured attention, and tackle length extrapolation problems through positional encoding, continued pretraining, and data engineering. These approaches typically require $\textbf{sequential access}$ to the document, necessitating reading from the first to the last token. We contend that for goal-oriented reading of long documents, such sequential access is not necessary, and a proficiently trained model can learn to omit hundreds of less pertinent tokens. Inspired by human reading behaviors and existing empirical observations, we propose $\textbf{random access}$, a novel reading strategy that enables transformers to efficiently process long documents without examining every token. Experimental results from pretraining, fine-tuning, and inference phases validate the efficacy of our method.

Equipping Transformer with Random-Access Reading for Long-Context Understanding

TL;DR

The paper tackles the challenge of long-context understanding in transformers by addressing the quadratic cost of self-attention and length extrapolation from short pretraining. It introduces Random-Access Reading, a framework that uses a data server and a simple, confidence-driven skip policy to skip token blocks during reading, with an optional memory module (Attendre) to preserve coherence across skipped segments. The approach is validated through pretraining, finetuning, and long-context question answering on tasks like the C4 corpus and TriviaQA, demonstrating improved efficiency and performance, including near sublinear complexity when memory is included. These results imply that dynamic, query-driven reading strategies can substantially reduce computation for long-context tasks while enabling effective adaptation of short-context models to long contexts, with practical impact on interactive LLM systems and long-document understanding.

Abstract

Long-context modeling presents a significant challenge for transformer-based large language models (LLMs) due to the quadratic complexity of the self-attention mechanism and issues with length extrapolation caused by pretraining exclusively on short inputs. Existing methods address computational complexity through techniques such as text chunking, the kernel approach, and structured attention, and tackle length extrapolation problems through positional encoding, continued pretraining, and data engineering. These approaches typically require to the document, necessitating reading from the first to the last token. We contend that for goal-oriented reading of long documents, such sequential access is not necessary, and a proficiently trained model can learn to omit hundreds of less pertinent tokens. Inspired by human reading behaviors and existing empirical observations, we propose , a novel reading strategy that enables transformers to efficiently process long documents without examining every token. Experimental results from pretraining, fine-tuning, and inference phases validate the efficacy of our method.
Paper Structure (18 sections, 1 equation, 5 figures, 1 table)

This paper contains 18 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration for our proposed random-access reading strategy for long-context modeling. In the traditional sequential access scenario (upper part of the figure), inputs are split into equal-sized chunks and fed to the model in sequential order. In contrast, we propose building an additional data server module (in the lower part of the figure) that takes relevant statistics from the model and decides which chunk it should read next.
  • Figure 2: Pretraining Experiments on C4 (4k+) for models w/o memory mechanism. We disable skipping in evaluation time for fair comparison. Intermediate skipping rate performs best after sufficient training, which confirms the effectiveness of our method.
  • Figure 3: Finetuning Experiments on C4 (4k+) for models pretrained on short-text. We add the best performance achieved in pretraining and a random initialized model performance for reference. We find that short-text pretraining will make model generalize worse even than a randomly initialized model for long-context language modeling, but skipping fine-tuning can help adapt such checkpoints to even perform better than specifically-pretrained long-context checkpoint.
  • Figure 4: The illustration of the average number of skipped tokens ("Average Skips") during C4 (4k+) pretraining without using the memory module. Under different skipping scenarios ($K>0$), the model gradually learns to skip more tokens, indicating it acquires a better understanding of the current long context and becomes more confident in making aggressive skipping decisions.
  • Figure 5: Long-Context Question Answering Experiment results. $x \rightarrow y$ means we train the model with skipping rate $K_{\text{train}}=x$ and evaluated using skipping rate $K_{\text{infer}}=y$. We find that adopting more aggressive skipping strategy helps a lot for improving the model performance.