The Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad, Jin Hwa Lee, Wes Gurnee, Max Tegmark
TL;DR
The paper investigates how decoder-only LLMs remain robust under layer deletions and adjacent-layer swaps, revealing nonuniform depth-dependent effects. It proposes a universal four-stage inference framework—detokenization, feature engineering, prediction ensembling, and residual sharpening—to interpret depth-wise computations across model families. Through layer interventions and targeted probes (e.g., WiC probing, logit lens, CKA analyses), it shows early and late layers are most sensitive while middle layers display notable resilience, supported by neuron-level analyses of prediction and suppression ensembles. The work provides a cohesive perspective on how redundancy and residual pathways enable self-repair and ensembling, with broad implications for interpretability, auditing, and robust model design across varied transformer architectures.
Abstract
We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any fine-tuning. We find that performance degradation is not uniform across layers: interventions to the early and final layers cause the most degradation, while the model is remarkably robust to dropping middle layers. This pattern of localized sensitivity motivates our hypothesis of four stages of inference, observed across diverse model families and sizes: (1) detokenization, where local context is integrated to lift raw token embeddings into higher-level representations; (2) feature engineering, where task- and entity-specific features are iteratively refined; (3) prediction ensembling, where hidden states are aggregated into plausible next-token predictions; and (4) residual sharpening, where irrelevant features are suppressed to finalize the output distribution. Synthesizing behavioral and mechanistic evidence, we provide a framework for interpreting depth-dependent computations in LLMs.
