Table of Contents
Fetching ...

Point, Segment and Count: A Generalized Framework for Object Counting

Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, Hongming Shan

TL;DR

A generalized framework for both few-shot and zero-shot object counting based on detection and generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarhical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals is proposed.

Abstract

Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection. Code: https://github.com/Hzzone/PseCo

Point, Segment and Count: A Generalized Framework for Object Counting

TL;DR

A generalized framework for both few-shot and zero-shot object counting based on detection and generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarhical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals is proposed.

Abstract

Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection. Code: https://github.com/Hzzone/PseCo
Paper Structure (27 sections, 4 equations, 7 figures, 6 tables)

This paper contains 27 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Sample results of vanilla SAM + CLIP and the proposed method. Given the class name (zero-shot) or example boxes (few-shot), our method can detect all objects in the image for counting.
  • Figure 2: Illustration of the proposed PseCo, following the steps: point, segment, and count. Given an input image, the point decoder predicts the class-agnostic heatmap to point out all objects. The image encoder and mask decoder from SAM are fixed during training (the prompt encoder is omitted here) and output the mask proposals. The proposals are classified with respect to CLIP image/text embeddings.
  • Figure 3: Sample results to generate the class-agnostic target heatmaps. Given (a) input image and (b) uniform grid point prompts, SAM predicts all (c) segmentation. We combine (d) all contour centers of segmentations to avoid bad point prompts and (e) ground-truth point annotations to produce (f) target heatmap. The resultant heatmap will be used to supervise the point decoder.
  • Figure 4: Qualitative results for (a) few-shot and (b) zero-shot object counting and detection. The class names, ground-truth counts, and our predicted counts are in color boxes. Zoom in for better view.
  • Figure 5: Qualitative comparisons for (a) few-shot (the first 3 columns) and (b) zero-shot (the last 2 columns) object counting. Only final points are placed in the second and third columns due to crowded predicted boxes. Zoom in for better view.
  • ...and 2 more figures