Table of Contents
Fetching ...

A Simple-but-effective Baseline for Training-free Class-Agnostic Counting

Yuhao Lin, Haiming Xu, Lingqiao Liu, Javen Qinfeng Shi

TL;DR

The paper tackles Class-Agnostic Counting (CAC) in a training-free setting by leveraging the Segment Anything Model (SAM) and four key technologies to close the performance gap with training-based CAC. It introduces superpixel-guided prompts, semantic-rich feature representations, a multiscale segmentation strategy, and a transductive prototype updating scheme, each contributing to improved object recall and discrimination without additional training. Empirical results on FSC-147 and CARPK show substantial improvements over prior training-free methods and competitive performance relative to trained approaches, effectively narrowing the training-free versus training-based gap. The work delivers a strong, training-free baseline for CAC and offers practical insights for deploying SAM-based counting in diverse, data-scarce scenarios.

Abstract

Class-Agnostic Counting (CAC) seeks to accurately count objects in a given image with only a few reference examples. While previous methods achieving this relied on additional training, recent efforts have shown that it's possible to accomplish this without training by utilizing pre-existing foundation models, particularly the Segment Anything Model (SAM), for counting via instance-level segmentation. Although promising, current training-free methods still lag behind their training-based counterparts in terms of performance. In this research, we present a straightforward training-free solution that effectively bridges this performance gap, serving as a strong baseline. The primary contribution of our work lies in the discovery of four key technologies that can enhance performance. Specifically, we suggest employing a superpixel algorithm to generate more precise initial point prompts, utilizing an image encoder with richer semantic knowledge to replace the SAM encoder for representing candidate objects, and adopting a multiscale mechanism and a transductive prototype scheme to update the representation of reference examples. By combining these four technologies, our approach achieves significant improvements over existing training-free methods and delivers performance on par with training-based ones.

A Simple-but-effective Baseline for Training-free Class-Agnostic Counting

TL;DR

The paper tackles Class-Agnostic Counting (CAC) in a training-free setting by leveraging the Segment Anything Model (SAM) and four key technologies to close the performance gap with training-based CAC. It introduces superpixel-guided prompts, semantic-rich feature representations, a multiscale segmentation strategy, and a transductive prototype updating scheme, each contributing to improved object recall and discrimination without additional training. Empirical results on FSC-147 and CARPK show substantial improvements over prior training-free methods and competitive performance relative to trained approaches, effectively narrowing the training-free versus training-based gap. The work delivers a strong, training-free baseline for CAC and offers practical insights for deploying SAM-based counting in diverse, data-scarce scenarios.

Abstract

Class-Agnostic Counting (CAC) seeks to accurately count objects in a given image with only a few reference examples. While previous methods achieving this relied on additional training, recent efforts have shown that it's possible to accomplish this without training by utilizing pre-existing foundation models, particularly the Segment Anything Model (SAM), for counting via instance-level segmentation. Although promising, current training-free methods still lag behind their training-based counterparts in terms of performance. In this research, we present a straightforward training-free solution that effectively bridges this performance gap, serving as a strong baseline. The primary contribution of our work lies in the discovery of four key technologies that can enhance performance. Specifically, we suggest employing a superpixel algorithm to generate more precise initial point prompts, utilizing an image encoder with richer semantic knowledge to replace the SAM encoder for representing candidate objects, and adopting a multiscale mechanism and a transductive prototype scheme to update the representation of reference examples. By combining these four technologies, our approach achieves significant improvements over existing training-free methods and delivers performance on par with training-based ones.
Paper Structure (15 sections, 4 equations, 6 figures, 7 tables)

This paper contains 15 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Visualizing the performance, our method notably narrows the gap in counting accuracy between training-free and training-based approaches. Impressively, it even surpasses the top-performing training-based methods on the CARPK dataset, underscoring the potential of training-free approaches.
  • Figure 2: Top row illustrates the creation process of object-prior point prompts. Bottom row depicts a pipeline overview of our proposed training-free object counting approach. Details of the transductive update module are provided in Eq. \ref{['eq:tpu']}. Reference objects are marked with yellow boxes in the input image (zoom in for clarity).
  • Figure 3: Effectiveness demonstration of the use of superpixel on the quality of mask proposals generated by SAM. Left: an image with reference exemplars (hot-air balloon) in black boxes. Right: a bubble chart illustrates the trade-off between the recall rate of the interested object and the time cost. Without increasing the number of point prompts, SAM with superpixelcan significantly improve the recall rate of hot-air balloons in the output mask proposals, thus avoiding the potential computational burden caused by the demand for denser point grids. Please find visualizations of the generated mask proposal in the appendix.
  • Figure 4: Visualization of the similarity mappings of image features to reference object features (marked in black boxes) using SAM and DINOv2. It distinctly shows that SAM's similarity mapping erroneously highlights numerous areas unrelated to the target object, whereas DINOv2's mapping accurately encompasses the objects of interest. This demonstrates that DINOv2's features possess more semantically relevant knowledge.
  • Figure 5: Effectiveness demonstration of our multi-scale mechanism in scenarios involving extremely tiny counting objects. Merely increasing the quantity of point promptsin SAM fails to yield accurate instance-level mask proposals. However, integrating SAM with our multi-scale mechanism (using 32*32 points for aerial photography of sheep in this demonstration) effectively achieves better mask quality.
  • ...and 1 more figures