Table of Contents
Fetching ...

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta

TL;DR

This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework, OmniCount, which distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting.

Abstract

Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions. The project webpage is available at https://mondalanindya.github.io/OmniCount.

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

TL;DR

This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework, OmniCount, which distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting.

Abstract

Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions. The project webpage is available at https://mondalanindya.github.io/OmniCount.
Paper Structure (35 sections, 4 equations, 24 figures, 12 tables)

This paper contains 35 sections, 4 equations, 24 figures, 12 tables.

Figures (24)

  • Figure 1: Object counting paradigms: (a) Typical single-label object counting models support open-vocabulary counting but processes only a single category one time. (b) Existing multi-label object counting models are training-based (i.e., not open-vocabulary) approaches and also fail to count non-atomic objects (e.g. , grapes). (c) We advocate more efficient and convenient multi-label open-vocabulary counting that is training-free, and supports counting all the target categories in a single pass.
  • Figure 2: OmniCount pipeline: OmniCount processes the input image and target object classes using Semantic Estimation (SAN) and Geometric Estimation (Marigold) modules to generate class-specific masks and depth maps. These initial semantic and geometric priors are then refined through an Object Recovery module, producing precise binary masks. The refined masks help extract RGB patches and reference points, reducing over-counting. SAM then uses these RGB patches and reference points to generate instance-level masks, resulting in accurate object counts. (denotes pre-trained, frozen models)
  • Figure 3: Geometry aware Object Recovery: We refine semantic masks with geometric priors using k-nearest neighbor searches to filter edge pixels by category uniqueness and depth alignment, enhancing mask precision through depth-integrated segmentation.
  • Figure 4: Reference Point Selection: SAM’s segmentation accuracy is enhanced by refining reference point selection. Panel (A) shows how integrating semantic priors, identifying local maxima, and applying Gaussian refinement improve reference point accuracy, focusing them on foreground objects for better segmentation and counting. Panel (B) demonstrates the benefits of incorporating semantic and geometric priors, where depth-based recovery and precise reference points help SAM recover distant or occluded objects, reducing over-segmentation issues found in the default "everything mode".
  • Figure 5: OmniCount-191 Annotations: A collection of images with $191$ classes across nine domains, annotating each image with captions, VQA, boxes, and points.
  • ...and 19 more figures