Table of Contents
Fetching ...

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Ambroise Odonnat, Vasilii Feofanov, Laetitia Chapel, Romain Tavenard, Ievgen Redko

TL;DR

The findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Abstract

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

TL;DR

The findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Abstract

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.
Paper Structure (19 sections, 6 figures, 3 tables)

This paper contains 19 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Layer by layer. Evolution of the linear probing performance across each layer of an $86$M ViT pretrained on ImageNet (the x-axis is the depth percentage of the layer). The solid line denotes the model only pretrained, and the dashed line denotes the model finetuned on the dataset at hand. From left to right, the shift between the pretraining and the downstream data increases. The stronger the shift, the worse the final layers.
  • Figure 2: Transformer block.
  • Figure 3: Layer by layer, module by module. Evolution of the linear probing performance of transformer modules across the layers of an $86$M ViT pretrained on ImageNet. From left to right, the shift between the pretraining and the downstream data increases.
  • Figure 4: ViT-Base Implementation.
  • Figure 5: Layer by layer. Evolution of the linear probing performance across the layers of an $86$M ViT pretrained on ImageNet. The solid line denotes the model only pretrained, and the dashed line denotes the model finetuned on the dataset at hand. From left to right, the shift between the pretraining and the downstream data increases. The stronger the shift, the worse the final layers.
  • ...and 1 more figures