AERO: Entropy-Guided Framework for Private LLM Inference
Nandan Kumar Jha, Brandon Reagen
TL;DR
AERO addresses the challenge of private LLM inference by reducing costly nonlinearities through an entropy-guided framework. It combines an inference-time LayerNorm substitute with an adaptive, per-head entropy regularizer that uses learnable thresholds and a tolerance margin to prevent entropic overload while preserving head diversity. Empirical results show substantial communication and latency savings (≈3.4× and 1.4×, respectively) with no degradation in perplexity, and improvements up to 6–8% in the most constrained Softmax-only settings. This approach provides practical gains for privacy-preserving inference and offers a principled design path for scalable, normalization-free LLM architectures.
Abstract
Privacy-preserving computation enables language model inference directly on encrypted data yet suffers from prohibitive latency and communication overheads, primarily due to nonlinear functions. Removing nonlinearities, however, can trigger one of two failure modes restricting the potential for nonlinearity removal: entropy collapse in deeper layers, which destabilizes training, and entropic overload in early layers, causing under-utilization of attention heads. To address these challenges, we introduce AERO, an entropy-guided framework to strategically eliminates costly nonlinear operations from transformer architectures, which employs an adaptive recalibration through a head-wise entropy regularizer with learnable per-head strengths, enabling each head to adjust its entropy level while penalizing extreme entropies and fostering functional diversity through a tolerance margin. Experiments show AERO can save 3.4$\times$ communication and 1.4$\times$ latency, without any performance penalty.
