First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models
Chi Ma, Mincong Huang, Ying Zhang, Chao Wang, Yujie Wang, Lei Yu, Chuan Liu, Wei Lin
TL;DR
This work tackles the high inference cost of large language models by introducing a training-free Threshold-based Dynamic Activation (TDA) that exploits sequence-level sparsity to accelerate generation by 18-25% with minimal accuracy loss. TDA uses offline thresholding to generate per-layer activation masks from the prompt, then applies these masks during generation to skip underutilized neurons in the FFN, avoiding retraining or reliance on ReLU-specific dynamics. The authors provide a theoretical framework for dynamic activation, identifying history-related activation uncertainty and semantic-irrelevant activation inertia as key drivers, and validate TDA across multiple model families and tasks, showing competitive or improved performance relative to training-dependent and training-free baselines. The approach offers practical, deployment-friendly speedups for diverse LLM architectures and provides insights that can guide future research in efficient model design, including adaptive depth and prompt compression strategies.
Abstract
Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25\% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.
