FiGO: Fine-Grained Object Counting without Annotations
Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh
TL;DR
FiGO tackles fine-grained object counting without manual annotations by learning a category-specific concept embedding that conditions a frozen counting model. It synthesizes training data with a diffusion model, builds coarse pseudo-annotations from attention maps, and uses positive and hard negative supervision to specialize the counter at test time. The LOOKALIKES dataset provides a focused benchmark for distinguishing closely related subcategories in dense scenes and demonstrates FiGO’s advantage over open-vocabulary segmentation baselines and prior CAC methods. The approach is computationally efficient, requiring only a few minutes to specialize per category and delivering substantial throughput gains during inference, making annotation-free fine-grained counting practically viable.
Abstract
Class-agnostic counting (CAC) methods reduce annotation costs by letting users define what to count at test-time through text or visual exemplars. However, current open-vocabulary approaches work well for broad categories but fail when fine-grained category distinctions are needed, such as telling apart waterfowl species or pepper cultivars. We present FiGO, a new annotation-free method that adapts existing counting models to fine-grained categories using only the category name. Our approach uses a text-to-image diffusion model to create synthetic examples and a joint positive/hard-negative loss to learn a compact concept embedding that conditions a specialization module to convert outputs from any frozen counter into accurate, fine-grained estimates. To evaluate fine-grained counting, we introduce LOOKALIKES, a dataset of 37 subcategories across 14 parent categories with many visually similar objects per image. Our method substantially outperforms strong open-vocabulary baselines, moving counting systems from "count all the peppers" to "count only the habaneros."
