Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

Weiyao Luo; Suncong Zheng; Heming Xia; Weikang Wang; Yan Lei; Tianyu Liu; Shuang Chen; Zhifang Sui

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

Weiyao Luo, Suncong Zheng, Heming Xia, Weikang Wang, Yan Lei, Tianyu Liu, Shuang Chen, Zhifang Sui

TL;DR

This paper tackles the challenge of long-term context in decoder-only Transformer LLMs by introducing sentinel tokens <SR> that summarize the information within text chunks. By inserting <SR> at chunk boundaries and modifying the attention mask, the model can retrieve both local token information and chunk-level semantics during decoding, with LoRA-based fine-tuning enabling lightweight training. Empirical results on WikiText-2 demonstrate perplexity improvements across multiple model families, and out-of-domain evaluations on DocumentQA and summarization show robust gains, including notable improvements for smaller models. The approach offers a scalable, low-cost enhancement to long-context language modeling with practical impact for real-world tasks requiring chunk-wise reasoning and memory.

Abstract

Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks. Specifically, we segment the text into multiple chunks and insert special token <SR> at the end of each chunk. We then modify the attention mask to integrate the chunk's information into the corresponding <SR> token. This facilitates LLMs to interpret information not only from historical individual tokens but also from the <SR> token, aggregating the chunk's semantic information. Experiments on language modeling and out-of-domain downstream tasks validate the superiority of our approach.

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

TL;DR

Abstract

Paper Structure (19 sections, 2 figures, 3 tables)

This paper contains 19 sections, 2 figures, 3 tables.

Introduction
Related Work
Attention Mask
Context Distillation
Approach
Adding Sentinel Tokens
Adapting Model Inputs for Sentinel Integration
Experiments
Models and Data
Experimental Setup
Results
Analysis
Breath Length Analysis
Generalization of Model Performance on Out-of-Domain
Data
...and 4 more sections

Figures (2)

Figure 1: The modified attention mask is illustrated in the figure, where cell $(r, c)$ signifies whether token r can attend to token c. $\texttt{chunk}_{2,1}$ represents the first token of the second chunk, with similar patterns for other chunks.
Figure 2: An example of DocumentQA illustrating the attention between each position in the question and sentinel tokens, where correct <SR> index is 2. More detailed explanation provided in Section \ref{['sec:out_analysis']}.

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

TL;DR

Abstract

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

Authors

TL;DR

Abstract

Table of Contents

Figures (2)