Table of Contents
Fetching ...

ResiDual Transformer Alignment with Spectral Decomposition

Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele Rodolà, Francesco Locatello

TL;DR

This work investigates why transformer heads exhibit task-specific specialization in residual streams, with a focus on vision transformers and vision-language models. By analyzing the spectral geometry of residual units, the authors show that head representations are low-dimensional and that a subset of principal components encodes meaningful, task-aligned semantics. They introduce ResiDual, a spectral anisotropic reweighting of residual units, enabling efficient, interpretable alignment with textual concepts and achieving near fine-tuning performance across multiple CLIP-like models and 10 datasets. The results demonstrate a practical, parameter-efficient route to improve zero-shot classification and cross-modal alignment, while offering a rich, interpretable view of how specialized units contribute to multimodal understanding.

Abstract

When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modelling an extremely interpretable and parameter-efficient transformation, as we extensively show on 70 pre-trained network-dataset combinations (7 models, 10 datasets).

ResiDual Transformer Alignment with Spectral Decomposition

TL;DR

This work investigates why transformer heads exhibit task-specific specialization in residual streams, with a focus on vision transformers and vision-language models. By analyzing the spectral geometry of residual units, the authors show that head representations are low-dimensional and that a subset of principal components encodes meaningful, task-aligned semantics. They introduce ResiDual, a spectral anisotropic reweighting of residual units, enabling efficient, interpretable alignment with textual concepts and achieving near fine-tuning performance across multiple CLIP-like models and 10 datasets. The results demonstrate a practical, parameter-efficient route to improve zero-shot classification and cross-modal alignment, while offering a rich, interpretable view of how specialized units contribute to multimodal understanding.

Abstract

When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modelling an extremely interpretable and parameter-efficient transformation, as we extensively show on 70 pre-trained network-dataset combinations (7 models, 10 datasets).

Paper Structure

This paper contains 31 sections, 19 equations, 21 figures, 9 tables, 2 algorithms.

Figures (21)

  • Figure 1: From the transformer's residual stream, direct contributions from individual heads across the network can be analyzed. In a multimodal, zero-shot classification setting (e.g., in CLIP), task boundaries are defined by text prompts that may vary in their conceptual granularity. When certain heads are specialized in particular features (e.g., shape, pattern, color), they may more accurately apply these boundaries than the model’s original output. In this example, only the fine-grained task (brown) effectively separates the samples at the output level.
  • Figure 2: Heads in early layers show low-dimensional, linear structures, as suggested by similar intrinsic dimension estimates from PCA (L) and TwoNN (N). Moving toward the output layer, the nonlinear dimensionality peaks and then decreases, while PCA’s linear estimate continues to rise, indicating increasing nonlinearity in head manifolds ($\text{Ratio} = \frac{L}{N}$). The first principal component (EVR$_1$) explains around 50% of the variance in early layers, dropping to around 10% in later layers.
  • Figure 3: Comparison between TextSpan and Orthogonal Matching Pursuit on the first principal component (OMP$_1$), applied to the heads of OpenCLIP-L. Left: agreement score between the descriptions returned by the two methods. Right: qualitative comparison of selected descriptions for 4 heads, one per layer, at different agreement levels. A similar analysis for the second principal component is presented in the Appendix in \ref{['sup:fig:ts_omp']}.
  • Figure 4: (a) Attention head similarity across layers of OpenCLIP-L, computed between ImageNet head representations and those obtained on other datasets. (b) Descriptions picked by SOMP for three specialized heads that emerge from the analysis of panel (a).
  • Figure 5: Performance comparison between text-image alignment methods on zero-shot classification tasks.
  • ...and 16 more figures