SirLLM: Streaming Infinite Retentive LLM

Yao Yao; Zuchao Li; Hai Zhao

SirLLM: Streaming Infinite Retentive LLM

Yao Yao, Zuchao Li, Hai Zhao

TL;DR

SirLLM tackles the challenge of infinite input lengths in LLMs by introducing a token-entropy–driven memory strategy that selectively preserves high-information tokens in the KV-cache, coupled with a decay mechanism to keep memories fresh. By defining token entropy as $e_i = -\log P(x_i | x_0, ..., x_{i-1})$, the method retains key tokens while discarding less informative ones, enabling long-running conversations without model fine-tuning. Across three tailored tasks—DailyDialog, Grocery Shopping, and Rock-Paper-Scissors—SirLLM demonstrates robust, consistent improvements across multiple LLMs compared to StreamLLM and other baselines, proving enhanced long-term memory and stable performance in streaming contexts. The results suggest practical benefits for real-world applications requiring long-context dialogue, with code available for replication and further exploration, although adaptive decay and improved memory relevance remain open avenues for future work.

Abstract

As Large Language Models (LLMs) become increasingly prevalent in various domains, their ability to process inputs of any length and maintain a degree of memory becomes essential. However, the one-off input of overly long texts is limited, as studies have shown that when input lengths exceed the LLMs' pre-trained text length, there is a dramatic decline in text generation capabilities. Moreover, simply extending the length of pre-training texts is impractical due to the difficulty in obtaining long text data and the substantial memory consumption costs this would entail for LLMs. Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs, but this approach can significantly impair the model's long-term memory capabilities. Motivated by this challenge, we introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues without the need for fine-tuning. SirLLM utilizes the Token Entropy metric and a memory decay mechanism to filter key phrases, endowing LLMs with both long-lasting and flexible memory. We designed three distinct tasks and constructed three datasets to measure the effectiveness of SirLLM from various angles: (1) DailyDialog; (2) Grocery Shopping; (3) Rock-Paper-Scissors. Our experimental results robustly demonstrate that SirLLM can achieve stable and significant improvements across different LLMs and tasks, compellingly proving its effectiveness. When having a coversation, "A sir could forget himself," but SirLLM never does! Our code is publicly available at https://github.com/Zoeyyao27/SirLLM

SirLLM: Streaming Infinite Retentive LLM

TL;DR

, the method retains key tokens while discarding less informative ones, enabling long-running conversations without model fine-tuning. Across three tailored tasks—DailyDialog, Grocery Shopping, and Rock-Paper-Scissors—SirLLM demonstrates robust, consistent improvements across multiple LLMs compared to StreamLLM and other baselines, proving enhanced long-term memory and stable performance in streaming contexts. The results suggest practical benefits for real-world applications requiring long-context dialogue, with code available for replication and further exploration, although adaptive decay and improved memory relevance remain open avenues for future work.

Abstract

Paper Structure (32 sections, 7 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 7 equations, 9 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Token Entropy
Does higher token entropy equate to increased LLM focus?
Streaming Infinite Retentive LLM
Experiments
Experimental Setup
Baslines
StreamLLM:
RandomLLM:
IntervalLLM:
Results
DailyDialog
...and 17 more sections

Figures (9)

Figure 1: The visualization of SirLLM versus existing attention patterns.
Figure 2: Attention sink phenomenon xiao2023efficient. We visualize the average layer attention logits over 256 sentences, each with a length of 20, in Vicuna-7b-v1.3. We can see that in the shallow layers, a significant amount of the attention score is dedicated to the first tokens and in the final layer, the model focuses more on the recent tokens.
Figure 3: Scatter Plot of the average attention weights over 256 sentences at every layer. We divide the tokens into four segments based on token entropy, with segment 1 having the lowest entropy and segment 4 the highest. Mean Weights stands for the average attention weights across all layers. Mean Rank denotes the average ranking of each segment at every layer. Mean 1st proportion denotes the proportion of times each segment ranked first across all layers. The figure indicates that as token entropy increases, so does the attention that the LLM allocates to that token.
Figure 4: Framework overview of SirLLM. When the number of tokens stored in KV cache exceeds the pre-training length $L$, SirLLM calculates the entropy of each token and selects the tokens with the higher token entropy, thereby conserving space in the KV cache
Figure 5: The perplexity of language modeling on 20K token text. The Sliding-window's PPL escalates dramatically once the token length exceeds the pre-trained length. In contrast, both SirLLM and StreamLLM, which incorporate attention sink tokens, show stable performance. SirLLM and StreamLLM's performances are almost identical, effectively demonstrating that SirLLM's memory mechanism does not impair the model's answering performance and can indeed reinforce the model's memory capabilities.
...and 4 more figures

SirLLM: Streaming Infinite Retentive LLM

TL;DR

Abstract

SirLLM: Streaming Infinite Retentive LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (9)