Table of Contents
Fetching ...

House of Cards: Massive Weights in LLMs

Jaehoon Oh, Seungjun Shin, Dokwan Oh

TL;DR

Massive activations bias LLMs by concentrating large magnitudes in a small set of feature dimensions in early FFN layers. The authors show these effects originate from the top-$k$ rows of ${\bm{W}}_{up}$ (and ${\bm{W}}_{gate}$) that map the FFN intermediate state $\hat{h}^{inter}_{l}$, revealing learning is dominated by a tiny weight subset during pre-training. They propose MacDrop, a curriculum dropout that targets these massive weights during parameter-efficient fine-tuning to reduce dependency on pre-trained massive weights. Across zero-shot, long-context, and ablation experiments, MacDrop improves performance and robustness for many models, though it is not uniformly effective across all architectures (e.g., Phi-3-medium and Gemma-2). These findings reveal a weight-space structure in LLMs and offer a practical, plug-and-play tool to bolster PEFT robustness.

Abstract

Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top-$k$ massive weights as the weights that contribute to the dimensions with the top-$k$ magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through various experiments, including zero-shot downstream tasks, long-context tasks, and ablation studies, we demonstrate that \texttt{MacDrop} generally improves performance and strengthens robustness.

House of Cards: Massive Weights in LLMs

TL;DR

Massive activations bias LLMs by concentrating large magnitudes in a small set of feature dimensions in early FFN layers. The authors show these effects originate from the top- rows of (and ) that map the FFN intermediate state , revealing learning is dominated by a tiny weight subset during pre-training. They propose MacDrop, a curriculum dropout that targets these massive weights during parameter-efficient fine-tuning to reduce dependency on pre-trained massive weights. Across zero-shot, long-context, and ablation experiments, MacDrop improves performance and robustness for many models, though it is not uniformly effective across all architectures (e.g., Phi-3-medium and Gemma-2). These findings reveal a weight-space structure in LLMs and offer a practical, plug-and-play tool to bolster PEFT robustness.

Abstract

Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top- massive weights as the weights that contribute to the dimensions with the top- magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through various experiments, including zero-shot downstream tasks, long-context tasks, and ablation studies, we demonstrate that \texttt{MacDrop} generally improves performance and strengthens robustness.
Paper Structure (35 sections, 3 equations, 72 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 72 figures, 10 tables, 1 algorithm.

Figures (72)

  • Figure 1: Massive weights.
  • Figure 2: Examples of generated responses.
  • Figure 4: (Top) Magnitudes of the hidden state and (Bottom) attention scores after Softmax of Mistral-7B, according to the position of the bos token. The described hidden state is the output of layer 16 (i.e., $h_{16}$). The attention scores are calculated at layer 17 (i.e., after massive activations appear) and averaged across different heads.
  • Figure 5: Various states in early layers.
  • Figure 6: Hidden states.
  • ...and 67 more figures