Table of Contents
Fetching ...

Vision Transformer Finetuning Benefits from Non-Smooth Components

Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

TL;DR

This study reframes finetuning of vision transformers through a plasticity lens, defining the average rate of input-output change $\mathscr{P}(f)$ to quantify component adaptability. The authors derive upper bounds and a theoretical ranking that places the multi-head attention module as the most plastic, followed by the first and second feedforward layers, with LayerNorms the least plastic. Empirical results on ViT-Base and ViT-Huge across 11 classification benchmarks show attention and early FFN layers yield the best and most robust finetuning performance, validating the theory and guiding selective finetuning strategies. The findings offer a novel perspective on the role of smoothness in transformer adaptation and provide practical guidance for parameter-efficient finetuning and future adaptation methods.

Abstract

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.

Vision Transformer Finetuning Benefits from Non-Smooth Components

TL;DR

This study reframes finetuning of vision transformers through a plasticity lens, defining the average rate of input-output change to quantify component adaptability. The authors derive upper bounds and a theoretical ranking that places the multi-head attention module as the most plastic, followed by the first and second feedforward layers, with LayerNorms the least plastic. Empirical results on ViT-Base and ViT-Huge across 11 classification benchmarks show attention and early FFN layers yield the best and most robust finetuning performance, validating the theory and guiding selective finetuning strategies. The findings offer a novel perspective on the role of smoothness in transformer adaptation and provide practical guidance for parameter-efficient finetuning and future adaptation methods.

Abstract

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.
Paper Structure (60 sections, 41 equations, 51 figures, 6 tables)

This paper contains 60 sections, 41 equations, 51 figures, 6 tables.

Figures (51)

  • Figure 1: Non-smooth components facilitate finetuning. We illustrate the benefits of a high plasticity during the finetuning of ViT-Base on Cifar10 (values normalized to $[0,1]$). Smooth modules like LayerNorm (top left) have low and steady rates of change, resulting in low plasticity (see \ref{['def:measure']}). This constrains the gradient norms during the optimization, leading to a slow descent on the loss landscape (bottom left). In contrast, the rates of change of non-smooth components, such as multi-head attention (top right), are large and vary a lot, resulting in high plasticity and gradients of high magnitude. This allows the exploration of the loss landscape and a faster descent towards (local) minima (bottom right).
  • Figure 2: Overview of our contributions. We conduct a comprehensive analysis of vision transformer components (left) through the perspective of their plasticity (\ref{['def:measure']}). Our theoretical analysis allows us to rank modules in terms of their plasticity (\ref{['sec:theory']}). Experiments on large-scale ViTs support our theoretical insights (\ref{['sec:analysis']}), as shown by the distribution of plasticity over all benchmarks (middle). Through large-scale finetuning runs on an $86$M-parameter ViT (\ref{['sec:finetuning']}), we demonstrate the real-world benefits of plasticity. As showcased by the average relative gain, i.e, improvement over the linear probing accuracy, on a diverse set of $11$ classification benchmarks (right), a higher plasticity yields greater finetuning benefits.
  • Figure 3: Plasticity analysis on Sketch. The distribution of rates of change $\|f(x)-f(y)\|_\mathrm{F}/\|x-y\|_\mathrm{F}$ on ViT-Base (left) follows the theoretical ranking of \ref{['sec:theory']}. We observe along transformer blocks of ViT-Base (middle) that the attention module has the highest plasticity $\mathscr{P}(f)$, followed by the first and second linear layers of the feedforward. The LayerNorms are the most rigid, with a plasticity below $1$. The same pattern is obtained on ViT-Huge (right), where the higher attention plasticity further validates our theory (see \ref{['prop:mha_bounds']}) since the sequence length $n$ is larger than with ViT-Base.
  • Figure 4: Benefits of plasticity on Sketch. Transformer components are ordered in terms of decreasing plasticity. We can see that the performance across learning rates and seeds (left) is better and more stable for plastic components. This can be understood by looking at the evolution of the gradient norms (middle) and the validation loss (right) throughout training: we can see that the higher plasticity, the larger gradient norms, and the better the generalization.
  • Figure 5: ViT-Base Implementation.
  • ...and 46 more figures

Theorems & Definitions (6)

  • Remark 5.1
  • proof
  • proof
  • proof
  • proof
  • proof