Table of Contents
Fetching ...

CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

Giacomo Pacini, Lorenzo Bianchi, Luca Ciampi, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

TL;DR

CountingDINO tackles class-agnostic counting without any labeled data by leveraging self-supervised DINO-based backbones to extract object-aware features. It builds exemplar-based density maps by applying exemplar representations as depthwise convolutional kernels on image features, followed by a normalization step that yields a density map whose integral equals the number of exemplars, and a thresholding scheme to suppress background. The method also increases spatial resolution through recursive quadrant partitioning, enabling better localization of small objects. Empirically, the training-free approach achieves competitive to state-of-the-art results on FSC-147 and CARPK, outperforming a purely unsupervised detector baseline and matching or surpassing several supervised CAC methods, highlighting the potential of self-supervised features for scalable open-world counting.

Abstract

Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we consistently outperform a baseline based on an SOTA unsupervised object detector under the same label- and training-free setting. Additionally, we achieve competitive results -- and in some cases surpass -- training-free methods that rely on supervised backbones, non-training-free unsupervised methods, as well as several fully supervised SOTA approaches. This demonstrates that label- and training-free CAC can be both scalable and effective. Code: https://lorebianchi98.github.io/CountingDINO/.

CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

TL;DR

CountingDINO tackles class-agnostic counting without any labeled data by leveraging self-supervised DINO-based backbones to extract object-aware features. It builds exemplar-based density maps by applying exemplar representations as depthwise convolutional kernels on image features, followed by a normalization step that yields a density map whose integral equals the number of exemplars, and a thresholding scheme to suppress background. The method also increases spatial resolution through recursive quadrant partitioning, enabling better localization of small objects. Empirically, the training-free approach achieves competitive to state-of-the-art results on FSC-147 and CARPK, outperforming a purely unsupervised detector baseline and matching or surpassing several supervised CAC methods, highlighting the potential of self-supervised features for scalable open-world counting.

Abstract

Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we consistently outperform a baseline based on an SOTA unsupervised object detector under the same label- and training-free setting. Additionally, we achieve competitive results -- and in some cases surpass -- training-free methods that rely on supervised backbones, non-training-free unsupervised methods, as well as several fully supervised SOTA approaches. This demonstrates that label- and training-free CAC can be both scalable and effective. Code: https://lorebianchi98.github.io/CountingDINO/.

Paper Structure

This paper contains 28 sections, 2 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: CountingDINO is a training-free and unsupervised class-agnostic counting method. The instance running with DINOv2 is able to outperform a training-free unsupervised detection-based baseline (CutLER), on both MAE and RMSE metrics (lower is better). Notably, it reaches comparable results (and even outperforms) previous non-training-free or supervised methods.
  • Figure 2: Overview of CountingDINO. Given an image $I$ and $N$ exemplar boxes, we extract features using the DINO-based visual backbone and apply each exemplar as a convolutional kernel over the image feature map to obtain similarity maps. These are aggregated, normalized into a density map using spatial priors, and thresholded before integration to produce the final count.
  • Figure 3: Qualitative results on FSC-147. Samples using DINOv2 ViT L/14 Reg. as backbone. Below the images we report density maps before and after background thresholding.
  • Figure 4: Qualitative results on FSC-147. The last samples were selected according to the highest counting error.
  • Figure 5: Qualitative results.
  • ...and 6 more figures