Table of Contents
Fetching ...

When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Ting-Yun Chang, Jesse Thomason, Robin Jia

TL;DR

This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components), and proposes component reweighting, which learns to linearly re-scale the component activations from a few labeled examples.

Abstract

This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 24 labeled examples, our method improves by an average of 6.0% accuracy points over 24-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.

When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models

TL;DR

This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components), and proposes component reweighting, which learns to linearly re-scale the component activations from a few labeled examples.

Abstract

This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 24 labeled examples, our method improves by an average of 6.0% accuracy points over 24-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.
Paper Structure (38 sections, 7 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 38 sections, 7 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Each dot represents a component (attention head or MLP) under 4-shot ICL on Llama-2-7B. The $x$-axis shows how often a component predicts "positive’’ on the test set. Up: We discover good-performing (blue), bad-performing (red), and label-biased (green) components. Down: Most components identified on SST2 show similar characteristics on Yelp-polarity.
  • Figure 2: Left: Transformer decomposition. The components---MLPs and attention heads---are filled with blue, and the blue lines show the flow of early decoding. Right: We can calculate the individual accuracy of every component after decomposition. Although a pair of templates that only differ slightly yield very different accuracies ($0.39$ vs. $0.89$ on AGNews with Llama-2-7B), the accuracies of their internal components are highly correlated. The top components for Template 1 overlap with the ones for Template 2 and achieve $>0.7$ accuracy despite the poor full-model accuracy.
  • Figure 3: The ICL accuracy of the full model (green) fluctuates greatly during pretraining. However, good-performing components (T1) emerge in the early steps.
  • Figure 4: Transformer architecture in GPT2.
  • Figure 5: Each dot represents an example in the test set. The two most biased components still insist on predicting the same label on the entire test set regardless of the labels of the demonstrations.
  • ...and 3 more figures