Learning to Count without Annotations

Lukas Knobel; Tengda Han; Yuki M. Asano

Learning to Count without Annotations

Lukas Knobel, Tengda Han, Yuki M. Asano

TL;DR

This work tackles reference-based counting without manual annotations by generating Self-Collages from unlabeled data and training a transformer-based counting model with pseudo-density supervision derived from segmentation. It leverages a frozen DINO-based ViT backbone for both image and exemplar encoding and a cross-attention-based interaction module to produce a density map $\hat{\mathbf{y}}$ conditioned on exemplars $\mathcal{S}$. Across FSC-147, CARPK, and MSO, UnCounTR outperforms simple baselines and, in several settings, matches supervised counting models, demonstrating strong generalization and robustness to domain shift. By enabling counting without labeled data and even permitting self-supervised semantic counting, this approach reduces annotation costs and opens avenues for scalable visual counting across diverse domains.

Abstract

While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose UnCounTR, a model that can learn this task without requiring any manual annotations. To this end, we construct "Self-Collages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate for the first time the ability of reference-based counting without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN and DETR, but also matches the performance of supervised counting models in some domains.

Learning to Count without Annotations

TL;DR

conditioned on exemplars

. Across FSC-147, CARPK, and MSO, UnCounTR outperforms simple baselines and, in several settings, matches supervised counting models, demonstrating strong generalization and robustness to domain shift. By enabling counting without labeled data and even permitting self-supervised semantic counting, this approach reduces annotation costs and opens avenues for scalable visual counting across diverse domains.

Abstract

Paper Structure (58 sections, 6 equations, 12 figures, 11 tables, 1 algorithm)

This paper contains 58 sections, 6 equations, 12 figures, 11 tables, 1 algorithm.

Introduction
Related work
Method
Constructing Self-Collages
Construction strategy
Density map construction
UnCounTR
Experiments
Implementation Details
Comparison against baselines
Ablation Study
Benchmark comparison
Qualitative results and limitations
Improving upon UnCounTR
Self-supervised semantic counting
...and 43 more sections

Figures (12)

Figure 1: CounTR vs. our proposed UnCounTRv2. Our counting model, UnCounTRv2, is trained without any labels and manual counting annotations. It generalizes well from its up to 19 pseudo-counts encountered during training to the significantly higher numbers in the FSC-147 test set. This demonstrates that learning to count is possible without annotations.
Figure 2: Method overview. Our method leverages the strong coherence of deep clusters to provide pseudo-labelled images which are used to construct a self-supervised counting task. The composer utilises self-supervised segmentations for pasting a set of objects onto a background image and UnCounTR is trained to count these when provided with unsupervised exemplars.
Figure 3: Qualitative examples of UnCounTR's predictions. We show predictions on four images from the FSC-147 test set, the green boxes represent the exemplars. Our predicted count is the sum of the density map rounded to the nearest integer.
Figure 4: Self-supervised semantic counting. In this setting, the model proposes the exemplars by itself and then performs reference-based counting.
Figure 5: Self-supervised semantic counting. To predict the number of objects without any prior, the model uses its DINO backbone to get initial exemplar candidates, which it subsequently refines and uses to predict density maps for the discovered object types.
...and 7 more figures

Learning to Count without Annotations

TL;DR

Abstract

Learning to Count without Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)