Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Zhicheng Wang; Liwen Xiao; Zhiguo Cao; Hao Lu

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu

TL;DR

CACViT reframes Class-Agnostic Counting (CAC) as an extract-and-match task within a single Vision Transformer. By concatenating query and exemplar tokens, the plain ViT performs both feature extraction and matching via self-attention, augmented with aspect-ratio-aware scale embedding and magnitude embedding to preserve scale and order-of-magnitude information. The approach achieves state-of-the-art results on FSC147 and demonstrates robust cross-dataset generalization to CARPK, validated through extensive ablations. Overall, CACViT provides a concise, strong baseline that leverages ViT for CAC with minimal task-specific engineering and improved generalization potential.

Abstract

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 8 figures, 5 tables)

This paper contains 12 sections, 3 equations, 8 figures, 5 tables.

Introduction
Related Work
Class-Agnostic Counting Vision Transformer
Overview of Approach
A Decoupled View of Self-Attention
Scale and Magnitude Priors
Experiments
Implementation Details
Comparison with State of the Art
Ablation Study
Cross-Dataset Generalization
Conclusions

Figures (8)

Figure 1: High-level ideas between prior arts and ours. (a) Previous ViT-based class-agnostic counting framework follows the extract-then-match paradigm with unshared feature extractors (e.g., a ViT and a CNN) for the query image and the exemplars and post-matching such as cross-attention after feature extraction; (b) Our ViT-based framework follows an extract-and-match paradigm using self-attention in a decoupled view, with additional aspect-ratio-aware scale embedding (SE) and the order-of-magnitude embedding (ME) for compensating the information loss of the scale in ViT.
Figure 2: The framework of CAC Vision Transformer (CACViT). A query image and exemplars with scale embedding are spilt into patches to form tokens. Then the flattened tokens are concatenated and fed into the transformer encoder. Afterward, the output feature of query image and similarity metric from attention map with magnitude embedding (ME) are concatenated for regression. Finally, a regression decoder predicts the density map. It noted that the attention map is similar to density map.
Figure 3: Decoupled view of self-attention in CACViT. The top-left $A_{query}$ can be regarded as self-attention of the query image. The bottom-left $A_{match}$ can be interpreted as cross-attention between query images and exemplars, despite being implemented in self-attention.
Figure 4: Visualizations of attention maps in a decoupled view for different layers. (a) $A_{query}$ and $A_{match}$ highlight the foreground and suppress the background. (b) In the shallow layers, $A_{class}$ favors foreground; but in the deep layers, $A_{class}$ highlights background.
Figure 5: Examples of images with different scale levels and aspect ratio levels. The first and second rows display the variations in scale and in aspect ratio in the FSC147 dataset, respectively.
...and 3 more figures

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

TL;DR

Abstract

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Authors

TL;DR

Abstract

Table of Contents

Figures (8)