Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

Shenglai Zeng; Tianqi Zheng; Chuan Tian; Dante Everaert; Yau-Shian Wang; Yupin Huang; Michael J. Morais; Rohit Patki; Jinjin Tian; Xinnan Dai; Kai Guo; Monica Xiao Cheng; Hui Liu

Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

Shenglai Zeng, Tianqi Zheng, Chuan Tian, Dante Everaert, Yau-Shian Wang, Yupin Huang, Michael J. Morais, Rohit Patki, Jinjin Tian, Xinnan Dai, Kai Guo, Monica Xiao Cheng, Hui Liu

TL;DR

This work tackles the challenge of personalizing LLMs under strict input-length constraints by introducing Attn-GS, an attention-guided context compression framework. It leverages a white-box marking model to identify important personalization sentences via attention patterns and then uses a summarization model to generate a compact, task-relevant user profile within a fixed token budget. Empirical results on MovieLens-1M and LaMP-5 show that Attn-GS consistently outperforms baselines in both inference-only and training-plus-inference settings, achieving performance close to full-context usage while reducing tokens by approximately $50\times$ and, in token-efficiency probes, by about $7\times$. The approach highlights the value of internal LLM signals for personalization and demonstrates that finetuning the marking model enhances the discrimination between important and unimportant signals, suggesting a practical pathway to deploy efficient, personalized LLMs in real-world applications.

Abstract

Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.

Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

TL;DR

and, in token-efficiency probes, by about

. The approach highlights the value of internal LLM signals for personalization and demonstrates that finetuning the marking model enhances the discrimination between important and unimportant signals, suggesting a practical pathway to deploy efficient, personalized LLMs in real-world applications.

Abstract

Paper Structure (39 sections, 6 equations, 18 figures, 6 tables, 1 algorithm)

This paper contains 39 sections, 6 equations, 18 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Personalized LLMs
Utilization of LLMs' internal signals
Preliminary Studies
Problem Description & Notations
Token-level attention scores.
Sentence-level attention scores.
Signal-level attention scores.
Key Findings
Datasets.
Experimental Setup.
Results & findings.
Method
Critical Personalization Sentence Marking
...and 24 more sections

Figures (18)

Figure 1: Personalized LLMs.
Figure 2: Attention visualization on MovieLens dataset. Layer-6 attention and cross-layer attention for non-fine-tuned $\Phi_{\text{Mark}}$ (a-b) and fine-tuned $\Phi_{\text{Mark}}$ (c-d), and performance comparison across signal subsets (e). U: User Basic Info, T: Title, R: User Rating, RT: Rating Time, Y: Movie Year, G: Genre, S: Movie Summary, All: All signals.
Figure 3: Attention visualization on LaMP-5 dataset. Layer-12 attention and cross-layer attention for non-fine-tuned $\Phi_{\text{Mark}}$ (a-b) and fine-tuned $\Phi_{\text{Mark}}$ (c-d), and performance comparison across signal subsets (e).
Figure 4: An illustration of Attn-GS framework.
Figure 5: Ablation Studies
...and 13 more figures

Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

TL;DR

Abstract

Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (18)