Table of Contents
Fetching ...

Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization

Shangding Gu

TL;DR

This work introduces PAPerBench, a large-scale benchmark that jointly evaluates personalization quality and privacy protection in LLMs across long-context inputs ranging from 1K to 256K tokens. Through extensive experiments on state-of-the-art models, it reveals a consistent scaling gap where both personalization and privacy degrade with longer contexts, with model capacity moderating but not eliminating the effect. A core theoretical contribution shows that softmax attention in fixed-capacity Transformers dilutes signals from sparse, task-relevant tokens as context grows, causing representation-level information loss and misalignment with user intent and privacy constraints. The findings highlight fundamental limitations of simply scaling context lengths and motivate new architectures and mechanisms to robustly support long-horizon privacy and personalization in real-world deployments. PAPerBench thus provides a reproducible framework for diagnosing and advancing scalable privacy-preserving personalization.

Abstract

Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench

Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization

TL;DR

This work introduces PAPerBench, a large-scale benchmark that jointly evaluates personalization quality and privacy protection in LLMs across long-context inputs ranging from 1K to 256K tokens. Through extensive experiments on state-of-the-art models, it reveals a consistent scaling gap where both personalization and privacy degrade with longer contexts, with model capacity moderating but not eliminating the effect. A core theoretical contribution shows that softmax attention in fixed-capacity Transformers dilutes signals from sparse, task-relevant tokens as context grows, causing representation-level information loss and misalignment with user intent and privacy constraints. The findings highlight fundamental limitations of simply scaling context lengths and motivate new architectures and mechanisms to robustly support long-horizon privacy and personalization in real-world deployments. PAPerBench thus provides a reproducible framework for diagnosing and advancing scalable privacy-preserving personalization.

Abstract

Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench
Paper Structure (44 sections, 2 theorems, 19 equations, 7 figures, 6 tables)

This paper contains 44 sections, 2 theorems, 19 equations, 7 figures, 6 tables.

Key Result

Theorem 6.1

Consider a Transformer layer with (single-head) attention Let the context $C_n=\{x_1,\ldots,x_n\}$ contain a task-relevant subsetok $R\subseteq[n]$ with $|R|=m$, where $m$ is fixed and independent of $n$. Denote the remaining indices by $N=[n]\setminus R$. Assume: Define the total attention mass assigned to relevant tokens as Then, as $n\to\infty$,

Figures (7)

  • Figure 1: We study how LLMs perform personalization and privacy reasoning from user background contexts of varying lengths.
  • Figure 2: Qwen3-235B (no-decoy, sparse, 64k): Accuracy drops sharply as the minimum number of involved categories increases ($k=2\rightarrow 3\rightarrow 4$), indicating that multi-category privacy reasoning becomes substantially harder with greater categorical complexity.
  • Figure 3: Privacy performance of Qwen3-235B across increasing context lengths, comparing decoy and no-decoy information settings. Decoy injection consistently reduces privacy accuracy, while both settings degrade under long contexts.
  • Figure 4: Privacy performance of Qwen3-235B with decoy injection under sparse and non-sparse privacy information context settings across context lengths. Sparse privacy information contexts consistently yield lower accuracy, indicating increased difficulty when privacy cues are sparse.
  • Figure 5: Privacy accuracy consistently decreases with context length increases. From 1k to 128k, both GPT-5.2 and Llama-4-Scout-109B are evaluated on the same set of 1,812 questions and exhibit a clear downward trend as context length grows. Due to high cost, the 256k results, evaluated on a reduced subset of 348 questions, continue this trend and are included for qualitative comparison.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 6.1: Attention Dilution under Context Scaling
  • proof : Proof sketch
  • Remark 6.2
  • Corollary 6.3: Unified Long-Context Performance Degradation
  • proof : Proof sketch
  • Definition 1.1: Near-Miss Personalization Option