Table of Contents
Fetching ...

Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias

TL;DR

The paper tackles the misalignment of linear probing with pre-trained representations that distribute information across patch tokens. It introduces Efficient Probing (EP), a lightweight multi-query cross-attention mechanism that eliminates redundant projections to achieve superior accuracy with far fewer trainable parameters than prior attentive probing methods. Through a systematic benchmarking across diverse pre-training paradigms and datasets, EP consistently surpasses linear probing and prior attentive approaches, while offering strong low-shot performance and robust layer-wise gains. The authors also reveal that EP yields diverse, complementary attention maps, linking localization quality to predictive power and suggesting new directions for using probing beyond evaluation, including interpretability and robustness. Code is released at the provided GitHub repository.

Abstract

As fine-tuning becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol. Yet, the standard linear probing fails to adequately reflect the potential of models whose pre-training optimizes representations of patch tokens rather than an explicit global representation. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on this, we propose efficient probing (EP), a simple yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Despite its simplicity, EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains. Beyond evaluation, our analysis uncovers emerging properties of EP, such as complementary attention maps, which open new directions for leveraging probing beyond protocol design. Code available at https://github.com/billpsomas/efficient-probing.

Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

TL;DR

The paper tackles the misalignment of linear probing with pre-trained representations that distribute information across patch tokens. It introduces Efficient Probing (EP), a lightweight multi-query cross-attention mechanism that eliminates redundant projections to achieve superior accuracy with far fewer trainable parameters than prior attentive probing methods. Through a systematic benchmarking across diverse pre-training paradigms and datasets, EP consistently surpasses linear probing and prior attentive approaches, while offering strong low-shot performance and robust layer-wise gains. The authors also reveal that EP yields diverse, complementary attention maps, linking localization quality to predictive power and suggesting new directions for using probing beyond evaluation, including interpretability and robustness. Code is released at the provided GitHub repository.

Abstract

As fine-tuning becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol. Yet, the standard linear probing fails to adequately reflect the potential of models whose pre-training optimizes representations of patch tokens rather than an explicit global representation. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on this, we propose efficient probing (EP), a simple yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Despite its simplicity, EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains. Beyond evaluation, our analysis uncovers emerging properties of EP, such as complementary attention maps, which open new directions for leveraging probing beyond protocol design. Code available at https://github.com/billpsomas/efficient-probing.

Paper Structure

This paper contains 19 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of multi-head cross-attention (MHCA, left) vs. our transformation-free multi-query cross-attention (ep, right). MHCA uses an input vector ${\bm{u}}$ projected into query space and interacts with key features $K$ in (two) separate subspaces, each corresponding to an attention predictor. Attention predictor outputs ${\bm{a}}_j$ are used to aggregate value features $V$ into sub-vectors ${\bm{y}}_j$, forming the final output ${\bm{y}}$. In contrast, ep employs (two) learnable queries${\bm{q}}_j$, one per attention predictor, to compute attention with input features directly in the full representation space. Attention predictor outputs ${\bm{a}}_j$ are used as in MHCA to perform the aggregation.
  • Figure 2: Top-1 classification accuracy vs. number of parameters for various self-supervised pre-training methods across different datasets. We evaluate both dedicated probing mechanisms (e.g., V-JEPA) and repurposed attentive pooling methods (e.g., CLIP). ep variants are marked with different colors for different output dimensionalities $D_o$. $\mathrm{EP}_M$: efficient probing with $M$ learnable queries. [CLS]: linear probing using the classification token; gap: global average pooling over patch tokens; mhca: multi-head cross-attention; ViT: default transformer block.
  • Figure 3: Top-1 classification accuracy vs. GFLOPs for MAE ViT-B with different probings on ImageNet-1K.
  • Figure 4: Classification accuracy vs. attention quality on ImageNet-1K. Each point corresponds to an attention predictor (head or query). $\Delta$ accuracy measures the drop when replacing an attention predictor’s distribution with uniform. Plots show relations to localization quality (1st, 3rd) and entropy (2nd, 4th). Left: different attentive probing methods; Right: varying $D_o$ for ep.
  • Figure :