Table of Contents
Fetching ...

AERO: Entropy-Guided Framework for Private LLM Inference

Nandan Kumar Jha, Brandon Reagen

TL;DR

AERO addresses the challenge of private LLM inference by reducing costly nonlinearities through an entropy-guided framework. It combines an inference-time LayerNorm substitute with an adaptive, per-head entropy regularizer that uses learnable thresholds and a tolerance margin to prevent entropic overload while preserving head diversity. Empirical results show substantial communication and latency savings (≈3.4× and 1.4×, respectively) with no degradation in perplexity, and improvements up to 6–8% in the most constrained Softmax-only settings. This approach provides practical gains for privacy-preserving inference and offers a principled design path for scalable, normalization-free LLM architectures.

Abstract

Privacy-preserving computation enables language model inference directly on encrypted data yet suffers from prohibitive latency and communication overheads, primarily due to nonlinear functions. Removing nonlinearities, however, can trigger one of two failure modes restricting the potential for nonlinearity removal: entropy collapse in deeper layers, which destabilizes training, and entropic overload in early layers, causing under-utilization of attention heads. To address these challenges, we introduce AERO, an entropy-guided framework to strategically eliminates costly nonlinear operations from transformer architectures, which employs an adaptive recalibration through a head-wise entropy regularizer with learnable per-head strengths, enabling each head to adjust its entropy level while penalizing extreme entropies and fostering functional diversity through a tolerance margin. Experiments show AERO can save 3.4$\times$ communication and 1.4$\times$ latency, without any performance penalty.

AERO: Entropy-Guided Framework for Private LLM Inference

TL;DR

AERO addresses the challenge of private LLM inference by reducing costly nonlinearities through an entropy-guided framework. It combines an inference-time LayerNorm substitute with an adaptive, per-head entropy regularizer that uses learnable thresholds and a tolerance margin to prevent entropic overload while preserving head diversity. Empirical results show substantial communication and latency savings (≈3.4× and 1.4×, respectively) with no degradation in perplexity, and improvements up to 6–8% in the most constrained Softmax-only settings. This approach provides practical gains for privacy-preserving inference and offers a principled design path for scalable, normalization-free LLM architectures.

Abstract

Privacy-preserving computation enables language model inference directly on encrypted data yet suffers from prohibitive latency and communication overheads, primarily due to nonlinear functions. Removing nonlinearities, however, can trigger one of two failure modes restricting the potential for nonlinearity removal: entropy collapse in deeper layers, which destabilizes training, and entropic overload in early layers, causing under-utilization of attention heads. To address these challenges, we introduce AERO, an entropy-guided framework to strategically eliminates costly nonlinear operations from transformer architectures, which employs an adaptive recalibration through a head-wise entropy regularizer with learnable per-head strengths, enabling each head to adjust its entropy level while penalizing extreme entropies and fostering functional diversity through a tolerance margin. Experiments show AERO can save 3.4 communication and 1.4 latency, without any performance penalty.

Paper Structure

This paper contains 32 sections, 9 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Latency and communication savings through nonlinearity reduction, and performance improvement through entropy regularization when AERO is applied on GPT-2 (125M), trained from scratch on CodeParrot dataset (detailed breakdown in Table \ref{['tab:GPT2CLen128']}).
  • Figure 2: An illustration of threat model for private LLM inference.
  • Figure 3: Distribution of attention heads (%) across entropy ranges for different model configurations (Table \ref{['tab:ArchConfigGPT2']}) trained from scratch, showing the concentration of heads in specific entropy intervals.
  • Figure 4: Entropy heatmaps of GPT-2 with GELU and ReLU in the FFN (a, b) and their normalization-free variants (c, d). Without LayerNorm, GELU causes significantly higher entropic overload.
  • Figure 5:
  • ...and 11 more figures