Table of Contents
Fetching ...

Entropy-Guided Attention for Private LLMs

Nandan Kumar Jha, Brandon Reagen

TL;DR

The paper addresses the high latency and communication overhead of private inference for decoder-only LLMs by proposing an information-theoretic lens that treats nonlinearities through Shannon entropy. It reveals that nonlinearities play a dual role: preventing entropy collapse in deep layers and preventing entropic overload in early layers, which preserves attention head diversity. To enable PI-friendly architectures, the authors introduce entropy-guided attention, entropy regularization with per-head thresholds and a learnable temperature, and PI-friendly normalization options such as weight and spectral normalization along with FFN scaling. Empirical results show substantial PI-efficiency gains, including up to a 3.94× reduction in communication and a 1.72× latency improvement, with scalable benefits across model depths and context sizes, signaling a practical path toward efficient private LLM inference.

Abstract

The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm

Entropy-Guided Attention for Private LLMs

TL;DR

The paper addresses the high latency and communication overhead of private inference for decoder-only LLMs by proposing an information-theoretic lens that treats nonlinearities through Shannon entropy. It reveals that nonlinearities play a dual role: preventing entropy collapse in deep layers and preventing entropic overload in early layers, which preserves attention head diversity. To enable PI-friendly architectures, the authors introduce entropy-guided attention, entropy regularization with per-head thresholds and a learnable temperature, and PI-friendly normalization options such as weight and spectral normalization along with FFN scaling. Empirical results show substantial PI-efficiency gains, including up to a 3.94× reduction in communication and a 1.72× latency improvement, with scalable benefits across model depths and context sizes, signaling a practical path toward efficient private LLM inference.

Abstract

The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm
Paper Structure (13 sections, 13 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 13 sections, 13 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of threat model and cryptographic protocols used for LLM private inference.
  • Figure 2: (a) The fraction of attention heads distributed across different entropy ranges, and (b) evaluation loss for GPT-2 (small) models with reduced-nonlinearities, when trained from scratch on CodeParrot dataset.
  • Figure 3: Headwise entropy distribution in LLM architectures with reduced nonlinearities compared to baseline models. Yellow regions indicate high-entropy concentrations, revealing severe entropic overload predominantly in early layers.
  • Figure 4: Nonlinearity-reduced simplified architecture with entropy-guided attention mechanism.
  • Figure 5: Layerwise entropy patterns in GPT-2 models ($L$ = 12, $H$ = 12, $d$ = 768) trained from scratch on CodeParrot dataset. Shown are (a) baseline model, (b) Softmax-only model without normalization, and variants with (c) weight normalization, (d) spectral normalization, and (e) scaled-FFN. While these normalization methods prevent entropy collapse, they fail to address entropic overload in early layers. Our final configuration (f) incorporates entropy regularization within scaled-FFN to effectively manage both issues.
  • ...and 3 more figures