Table of Contents
Fetching ...

AFreeCA: Annotation-Free Counting for All

Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

TL;DR

This work tackles the annotation burden in object counting by using latent diffusion models (LDMs) to generate synthetic data for an unsupervised counting pipeline. It introduces a sorting network trained on triplets created by adding or removing objects, followed by anchoring a counting network with synthetic data, and a density-guided partitioning strategy (DCGP) to handle dense scenes through high-resolution patches. The approach achieves state-of-the-art performance among unsupervised and zero-shot methods across crowd and vehicle counting benchmarks and demonstrates cross-category generalization to diverse object classes. By reducing the need for manual annotations and enabling counting across arbitrary categories, it offers a practical pathway toward scalable, annotation-free counting in real-world diverse environments.

Abstract

Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these models can be utilized to generate counting datasets. However, LDMs struggle to create images with an exact number of objects based solely on text prompts but they can be used to offer a dependable \textit{sorting} signal by adding and removing objects within an image. Leveraging this data, we initially introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes using counting data generated by LDMs. Further, we present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted. Consequently, we can generate counting data for any type of object and count them in an unsupervised manner. Our approach outperforms other unsupervised and few-shot alternatives and is not restricted to specific object classes for which counting data is available. Code to be released upon acceptance.

AFreeCA: Annotation-Free Counting for All

TL;DR

This work tackles the annotation burden in object counting by using latent diffusion models (LDMs) to generate synthetic data for an unsupervised counting pipeline. It introduces a sorting network trained on triplets created by adding or removing objects, followed by anchoring a counting network with synthetic data, and a density-guided partitioning strategy (DCGP) to handle dense scenes through high-resolution patches. The approach achieves state-of-the-art performance among unsupervised and zero-shot methods across crowd and vehicle counting benchmarks and demonstrates cross-category generalization to diverse object classes. By reducing the need for manual annotations and enabling counting across arbitrary categories, it offers a practical pathway toward scalable, annotation-free counting in real-world diverse environments.

Abstract

Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these models can be utilized to generate counting datasets. However, LDMs struggle to create images with an exact number of objects based solely on text prompts but they can be used to offer a dependable \textit{sorting} signal by adding and removing objects within an image. Leveraging this data, we initially introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes using counting data generated by LDMs. Further, we present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted. Consequently, we can generate counting data for any type of object and count them in an unsupervised manner. Our approach outperforms other unsupervised and few-shot alternatives and is not restricted to specific object classes for which counting data is available. Code to be released upon acceptance.
Paper Structure (41 sections, 11 equations, 15 figures, 10 tables)

This paper contains 41 sections, 11 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: We propose a method which exploits synthetic counting data generated by Stable Diffusion. With this, we establish an annotator-free method that produces accurate count maps without location-based supervision for a wide range of object categories.
  • Figure 2: Left: when given a prompt count of 20, Stable Diffusion outputs images with a similar but often incorrect object count. Right: as the prompt count increases, the relative error between the true underlying count and the prompt count increases.
  • Figure 3: Workflow. Our framework uses simple prompts to create synthetic data for training a sorting model, a density classifier, and a count anchoring network. These elements are combined into a model which can accurately count diverse object categories even within dense images by subdividing them into smaller, more manageable areas.
  • Figure 4: Sorting Features. We calculate the channel-wise mean of the features produced by the sorting network to demonstrate where the network is active. The network appears to focus on the object of interest across a wide range of crowd densities.
  • Figure 5: Methodology. Our strategy involves three distinct steps supported by a synthetic training signal extracted from stable diffusion. The pre-training step sorts synthetic and real images to learn high quality object quantity features from the source distribution. The finetuning step utilizes the pre-trained features and synthetic data to train a counting head and a density classification head. Finally, the density classifier guides inference by partitioning dense images so that there are fewer objects per image.
  • ...and 10 more figures