Table of Contents
Fetching ...

FiGO: Fine-Grained Object Counting without Annotations

Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

TL;DR

FiGO tackles fine-grained object counting without manual annotations by learning a category-specific concept embedding that conditions a frozen counting model. It synthesizes training data with a diffusion model, builds coarse pseudo-annotations from attention maps, and uses positive and hard negative supervision to specialize the counter at test time. The LOOKALIKES dataset provides a focused benchmark for distinguishing closely related subcategories in dense scenes and demonstrates FiGO’s advantage over open-vocabulary segmentation baselines and prior CAC methods. The approach is computationally efficient, requiring only a few minutes to specialize per category and delivering substantial throughput gains during inference, making annotation-free fine-grained counting practically viable.

Abstract

Class-agnostic counting (CAC) methods reduce annotation costs by letting users define what to count at test-time through text or visual exemplars. However, current open-vocabulary approaches work well for broad categories but fail when fine-grained category distinctions are needed, such as telling apart waterfowl species or pepper cultivars. We present FiGO, a new annotation-free method that adapts existing counting models to fine-grained categories using only the category name. Our approach uses a text-to-image diffusion model to create synthetic examples and a joint positive/hard-negative loss to learn a compact concept embedding that conditions a specialization module to convert outputs from any frozen counter into accurate, fine-grained estimates. To evaluate fine-grained counting, we introduce LOOKALIKES, a dataset of 37 subcategories across 14 parent categories with many visually similar objects per image. Our method substantially outperforms strong open-vocabulary baselines, moving counting systems from "count all the peppers" to "count only the habaneros."

FiGO: Fine-Grained Object Counting without Annotations

TL;DR

FiGO tackles fine-grained object counting without manual annotations by learning a category-specific concept embedding that conditions a frozen counting model. It synthesizes training data with a diffusion model, builds coarse pseudo-annotations from attention maps, and uses positive and hard negative supervision to specialize the counter at test time. The LOOKALIKES dataset provides a focused benchmark for distinguishing closely related subcategories in dense scenes and demonstrates FiGO’s advantage over open-vocabulary segmentation baselines and prior CAC methods. The approach is computationally efficient, requiring only a few minutes to specialize per category and delivering substantial throughput gains during inference, making annotation-free fine-grained counting practically viable.

Abstract

Class-agnostic counting (CAC) methods reduce annotation costs by letting users define what to count at test-time through text or visual exemplars. However, current open-vocabulary approaches work well for broad categories but fail when fine-grained category distinctions are needed, such as telling apart waterfowl species or pepper cultivars. We present FiGO, a new annotation-free method that adapts existing counting models to fine-grained categories using only the category name. Our approach uses a text-to-image diffusion model to create synthetic examples and a joint positive/hard-negative loss to learn a compact concept embedding that conditions a specialization module to convert outputs from any frozen counter into accurate, fine-grained estimates. To evaluate fine-grained counting, we introduce LOOKALIKES, a dataset of 37 subcategories across 14 parent categories with many visually similar objects per image. Our method substantially outperforms strong open-vocabulary baselines, moving counting systems from "count all the peppers" to "count only the habaneros."

Paper Structure

This paper contains 36 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overcounting is a common failure mode for open-world counting methods when presented visually similar objects.
  • Figure 2: Given only a text prompt, our method adapts counting models to novel fine-grained categories at test-time.
  • Figure 3: DAVE Pelhan_2024_CVPR uses a filtering strategy to reduce false positives but fails on fine-grained categories, leading to overcounting. Boxes represent visual exemplars selected from each image.
  • Figure 4: Sample images from Lookalikes, our fine-grained counting dataset. The dataset includes crowded, diverse images with multiple distinct object subcategories in each scene, highlighting the variety and complexity of the data.
  • Figure 5: Overview of our FiGO specialization pipeline for fine-grained counting. Given a target subcategory (e.g., Canada Goose), (1) Synthesize: A diffusion model generates positive and negative examples, and category-relevant attention maps are extracted by averaging attention across layers and heads to produce dense pseudo-annotations. (2) Tune: These positive and negative synthetic pairs supervise a learnable concept embedding inside a frozen CLIPSeg model, encouraging activation primarily on the target subcategory. (3) Specialize: At inference, the tuned embedding refines the output of a frozen class-agnostic counter, yielding an accurate category-specific count.
  • ...and 2 more figures