Table of Contents
Fetching ...

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

Weiyao Luo, Suncong Zheng, Heming Xia, Weikang Wang, Yan Lei, Tianyu Liu, Shuang Chen, Zhifang Sui

TL;DR

This paper tackles the challenge of long-term context in decoder-only Transformer LLMs by introducing sentinel tokens <SR> that summarize the information within text chunks. By inserting <SR> at chunk boundaries and modifying the attention mask, the model can retrieve both local token information and chunk-level semantics during decoding, with LoRA-based fine-tuning enabling lightweight training. Empirical results on WikiText-2 demonstrate perplexity improvements across multiple model families, and out-of-domain evaluations on DocumentQA and summarization show robust gains, including notable improvements for smaller models. The approach offers a scalable, low-cost enhancement to long-context language modeling with practical impact for real-world tasks requiring chunk-wise reasoning and memory.

Abstract

Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks. Specifically, we segment the text into multiple chunks and insert special token <SR> at the end of each chunk. We then modify the attention mask to integrate the chunk's information into the corresponding <SR> token. This facilitates LLMs to interpret information not only from historical individual tokens but also from the <SR> token, aggregating the chunk's semantic information. Experiments on language modeling and out-of-domain downstream tasks validate the superiority of our approach.

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

TL;DR

This paper tackles the challenge of long-term context in decoder-only Transformer LLMs by introducing sentinel tokens <SR> that summarize the information within text chunks. By inserting <SR> at chunk boundaries and modifying the attention mask, the model can retrieve both local token information and chunk-level semantics during decoding, with LoRA-based fine-tuning enabling lightweight training. Empirical results on WikiText-2 demonstrate perplexity improvements across multiple model families, and out-of-domain evaluations on DocumentQA and summarization show robust gains, including notable improvements for smaller models. The approach offers a scalable, low-cost enhancement to long-context language modeling with practical impact for real-world tasks requiring chunk-wise reasoning and memory.

Abstract

Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks. Specifically, we segment the text into multiple chunks and insert special token <SR> at the end of each chunk. We then modify the attention mask to integrate the chunk's information into the corresponding <SR> token. This facilitates LLMs to interpret information not only from historical individual tokens but also from the <SR> token, aggregating the chunk's semantic information. Experiments on language modeling and out-of-domain downstream tasks validate the superiority of our approach.
Paper Structure (19 sections, 2 figures, 3 tables)

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The modified attention mask is illustrated in the figure, where cell $(r, c)$ signifies whether token r can attend to token c. $\texttt{chunk}_{2,1}$ represents the first token of the second chunk, with similar patterns for other chunks.
  • Figure 2: An example of DocumentQA illustrating the attention between each position in the question and sentinel tokens, where correct <SR> index is 2. More detailed explanation provided in Section \ref{['sec:out_analysis']}.