Table of Contents
Fetching ...

GAIN: Multiplicative Modulation for Domain Adaptation

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

Abstract

Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience, where neurons adapt to context by scaling response strength while preserving selectivity. We evaluate GAIN on five models from four families (774M to 70B), adapting sequentially across eight domains. GAIN-FFN matches LoRA's in-domain adaptation, but their effects on previously trained domains are opposite: GAIN-FFN improves them by 7-13% (validation PPL), while LoRA degrades them by 18-36%. Downstream accuracy confirms the pattern: for example, after seven sequential adaptations on Qwen2.5, GAIN-FFN degrades BoolQ by only 0.8% while LoRA damages it by 14.9%. GAIN adds 46K-230K parameters per model and can be absorbed into the pretrained weights for zero inference cost.

GAIN: Multiplicative Modulation for Domain Adaptation

Abstract

Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience, where neurons adapt to context by scaling response strength while preserving selectivity. We evaluate GAIN on five models from four families (774M to 70B), adapting sequentially across eight domains. GAIN-FFN matches LoRA's in-domain adaptation, but their effects on previously trained domains are opposite: GAIN-FFN improves them by 7-13% (validation PPL), while LoRA degrades them by 18-36%. Downstream accuracy confirms the pattern: for example, after seven sequential adaptations on Qwen2.5, GAIN-FFN degrades BoolQ by only 0.8% while LoRA damages it by 14.9%. GAIN adds 46K-230K parameters per model and can be absorbed into the pretrained weights for zero inference cost.

Paper Structure

This paper contains 27 sections, 1 theorem, 8 equations, 4 figures, 12 tables.

Key Result

Proposition 1

For diagonal $S$ and weight matrix $W$: That is, multiplicative modulation can only produce outputs within the pretrained model's output subspace. $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Per-token loss change on four unrelated domains after medical adaptation (GPT-2 Large). GAIN ($\sigma \approx 0.12$) is $3\times$ narrower than LoRA ($\sigma \approx 0.30$).
  • Figure 2: LoRA's forgetting-adaptation tradeoff. Red points are LoRA with different learning rates and L2 regularization strengths ($\lambda$). GAIN-FFN (blue stars) operates in a different regime---the entire curve lies above zero.
  • Figure 3: Loss landscape interpolation on Mistral-7B. Left: in-domain PPL decreases for both. Right: GAIN's cross-domain loss is flat; LoRA's rises steeply.
  • Figure 4: Cross-domain loss when WikiText and Medical adaptations are combined. Left: GAIN---uniform blue; any combination preserves LAMBADA. Right: LoRA---red regions appear when adaptations are combined.

Theorems & Definitions (2)

  • Proposition 1: Output Space Preservation
  • proof