Table of Contents
Fetching ...

First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

Chi Ma, Mincong Huang, Ying Zhang, Chao Wang, Yujie Wang, Lei Yu, Chuan Liu, Wei Lin

TL;DR

This work tackles the high inference cost of large language models by introducing a training-free Threshold-based Dynamic Activation (TDA) that exploits sequence-level sparsity to accelerate generation by 18-25% with minimal accuracy loss. TDA uses offline thresholding to generate per-layer activation masks from the prompt, then applies these masks during generation to skip underutilized neurons in the FFN, avoiding retraining or reliance on ReLU-specific dynamics. The authors provide a theoretical framework for dynamic activation, identifying history-related activation uncertainty and semantic-irrelevant activation inertia as key drivers, and validate TDA across multiple model families and tasks, showing competitive or improved performance relative to training-dependent and training-free baselines. The approach offers practical, deployment-friendly speedups for diverse LLM architectures and provides insights that can guide future research in efficient model design, including adaptive depth and prompt compression strategies.

Abstract

Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25\% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.

First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

TL;DR

This work tackles the high inference cost of large language models by introducing a training-free Threshold-based Dynamic Activation (TDA) that exploits sequence-level sparsity to accelerate generation by 18-25% with minimal accuracy loss. TDA uses offline thresholding to generate per-layer activation masks from the prompt, then applies these masks during generation to skip underutilized neurons in the FFN, avoiding retraining or reliance on ReLU-specific dynamics. The authors provide a theoretical framework for dynamic activation, identifying history-related activation uncertainty and semantic-irrelevant activation inertia as key drivers, and validate TDA across multiple model families and tasks, showing competitive or improved performance relative to training-dependent and training-free baselines. The approach offers practical, deployment-friendly speedups for diverse LLM architectures and provides insights that can guide future research in efficient model design, including adaptive depth and prompt compression strategies.

Abstract

Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25\% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.
Paper Structure (25 sections, 20 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 20 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Training-Dependent DA
  • Figure 2: Training-Free TDA
  • Figure 3: Active pattern of 16 tokens separately
  • Figure 4: Active pattern of these 16 tokens as a sentence
  • Figure 5: Active pattern of 4 random tokens separately
  • ...and 6 more figures

Theorems & Definitions (11)

  • Claim 1
  • Definition 1
  • Definition 2
  • Proof 1
  • Claim 2
  • Definition 3
  • Proof 2
  • Claim 3
  • Definition 4
  • Proof 3
  • ...and 1 more