Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization
Shangding Gu
TL;DR
This work introduces PAPerBench, a large-scale benchmark that jointly evaluates personalization quality and privacy protection in LLMs across long-context inputs ranging from 1K to 256K tokens. Through extensive experiments on state-of-the-art models, it reveals a consistent scaling gap where both personalization and privacy degrade with longer contexts, with model capacity moderating but not eliminating the effect. A core theoretical contribution shows that softmax attention in fixed-capacity Transformers dilutes signals from sparse, task-relevant tokens as context grows, causing representation-level information loss and misalignment with user intent and privacy constraints. The findings highlight fundamental limitations of simply scaling context lengths and motivate new architectures and mechanisms to robustly support long-horizon privacy and personalization in real-world deployments. PAPerBench thus provides a reproducible framework for diagnosing and advancing scalable privacy-preserving personalization.
Abstract
Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench
