Table of Contents
Fetching ...

Enhancing Zero-shot Counting via Language-guided Exemplar Learning

Mingjie Wang, Jun Zhou, Yong Dai, Eric Buys, Minglun Gong

TL;DR

This work tackles zero-shot counting across arbitrary object categories by integrating language priors into exemplar learning. It introduces ExpressCount, a two-branch framework where a language-oriented Exemplar Perceptron leverages linguistic and visual signals via a multimodal transformer, while a dual-branch counting pathway uses cross-attention to learn fine-grained similarity for unseen classes. A counting-focused linguistic dataset, FSC-147-Express, supplies fine-grained expressions to train and evaluate language-guided counting. Experiments show state-of-the-art performance on FSC-147 benchmarks, with substantial gains over exemplar-free and GEL approaches, and ablations confirm the benefits of detailed expressions, multiple exemplars, and hybrid supervision. The approach broadens zero-shot counting applicability, reduces annotation burdens, and establishes a new direction for language-informed counting models with practical impact on diverse counting tasks.

Abstract

Recently, Class-Agnostic Counting (CAC) problem has garnered increasing attention owing to its intriguing generality and superior efficiency compared to Category-Specific Counting (CSC). This paper proposes a novel ExpressCount to enhance zero-shot object counting by delving deeply into language-guided exemplar learning. Specifically, the ExpressCount is comprised of an innovative Language-oriented Exemplar Perceptron and a downstream visual Zero-shot Counting pipeline. Thereinto, the perceptron hammers at exploiting accurate exemplar cues from collaborative language-vision signals by inheriting rich semantic priors from the prevailing pre-trained Large Language Models (LLMs), whereas the counting pipeline excels in mining fine-grained features through dual-branch and cross-attention schemes, contributing to the high-quality similarity learning. Apart from building a bridge between the LLM in vogue and the visual counting tasks, expression-guided exemplar estimation significantly advances zero-shot learning capabilities for counting instances with arbitrary classes. Moreover, devising a FSC-147-Express with annotations of meticulous linguistic expressions pioneers a new venue for developing and validating language-based counting models. Extensive experiments demonstrate the state-of-the-art performance of our ExpressCount, even showcasing the accuracy on par with partial CSC models.

Enhancing Zero-shot Counting via Language-guided Exemplar Learning

TL;DR

This work tackles zero-shot counting across arbitrary object categories by integrating language priors into exemplar learning. It introduces ExpressCount, a two-branch framework where a language-oriented Exemplar Perceptron leverages linguistic and visual signals via a multimodal transformer, while a dual-branch counting pathway uses cross-attention to learn fine-grained similarity for unseen classes. A counting-focused linguistic dataset, FSC-147-Express, supplies fine-grained expressions to train and evaluate language-guided counting. Experiments show state-of-the-art performance on FSC-147 benchmarks, with substantial gains over exemplar-free and GEL approaches, and ablations confirm the benefits of detailed expressions, multiple exemplars, and hybrid supervision. The approach broadens zero-shot counting applicability, reduces annotation burdens, and establishes a new direction for language-informed counting models with practical impact on diverse counting tasks.

Abstract

Recently, Class-Agnostic Counting (CAC) problem has garnered increasing attention owing to its intriguing generality and superior efficiency compared to Category-Specific Counting (CSC). This paper proposes a novel ExpressCount to enhance zero-shot object counting by delving deeply into language-guided exemplar learning. Specifically, the ExpressCount is comprised of an innovative Language-oriented Exemplar Perceptron and a downstream visual Zero-shot Counting pipeline. Thereinto, the perceptron hammers at exploiting accurate exemplar cues from collaborative language-vision signals by inheriting rich semantic priors from the prevailing pre-trained Large Language Models (LLMs), whereas the counting pipeline excels in mining fine-grained features through dual-branch and cross-attention schemes, contributing to the high-quality similarity learning. Apart from building a bridge between the LLM in vogue and the visual counting tasks, expression-guided exemplar estimation significantly advances zero-shot learning capabilities for counting instances with arbitrary classes. Moreover, devising a FSC-147-Express with annotations of meticulous linguistic expressions pioneers a new venue for developing and validating language-based counting models. Extensive experiments demonstrate the state-of-the-art performance of our ExpressCount, even showcasing the accuracy on par with partial CSC models.
Paper Structure (26 sections, 2 equations, 7 figures, 4 tables)

This paper contains 26 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) The conventional CAC models necessitate user-provided exemplars, imposing a substantial manual burden for specifying dense object locations and impeding the models' applicability; (b) Rudimentary exemplar-free CAC hammers at learning exemplar cues in a traversal manner, often resulting in unsatisfactory results owing to the presence of semantic ambiguity; (c) Language-guided Exemplar Learning excels in enriching linguistic semantics to steer the accurate exemplar/counts regressions.
  • Figure 2: The overall architecture of our ExpressCount, which introduces an effective language-oriented exemplar perceptron into visual counting tasks. Specifically, the exemplar perceptron takes a natural image and one detailed expression as input, characterizing and blending both textual and visual signals to guide the accurate learning of exemplars. Moreover, a dual-branch network with cross-attention recalibration is proposed to automatically perform similarity learning, finally inferring the counts of object instances with unseen classes.
  • Figure 3: The generation example of annotating language expressions in the language-vision counting dataset (FSC-147-Express).
  • Figure 4: Visual comparison among three versions of our language-based methods. The first row depicts the exemplar predictions, the second row shows the captured exemplar patches in zoomed-in views, and the third row demonstrates three corresponding similarity maps. It can be observed that the guidance of language- and image-oriented semantic priors both contributes to the enhance the accuracy of exemplar prediction.
  • Figure 5: Visual comparisons for ablation studies. Top: exemplar extraction under coarse-to-fine expressions shows the benefit of detailed language expressions. Bottom: predicting three exemplars instead of one may have both positive (diversified samples) or bad (added noises) impacts.
  • ...and 2 more figures