Enhancing Zero-shot Counting via Language-guided Exemplar Learning
Mingjie Wang, Jun Zhou, Yong Dai, Eric Buys, Minglun Gong
TL;DR
This work tackles zero-shot counting across arbitrary object categories by integrating language priors into exemplar learning. It introduces ExpressCount, a two-branch framework where a language-oriented Exemplar Perceptron leverages linguistic and visual signals via a multimodal transformer, while a dual-branch counting pathway uses cross-attention to learn fine-grained similarity for unseen classes. A counting-focused linguistic dataset, FSC-147-Express, supplies fine-grained expressions to train and evaluate language-guided counting. Experiments show state-of-the-art performance on FSC-147 benchmarks, with substantial gains over exemplar-free and GEL approaches, and ablations confirm the benefits of detailed expressions, multiple exemplars, and hybrid supervision. The approach broadens zero-shot counting applicability, reduces annotation burdens, and establishes a new direction for language-informed counting models with practical impact on diverse counting tasks.
Abstract
Recently, Class-Agnostic Counting (CAC) problem has garnered increasing attention owing to its intriguing generality and superior efficiency compared to Category-Specific Counting (CSC). This paper proposes a novel ExpressCount to enhance zero-shot object counting by delving deeply into language-guided exemplar learning. Specifically, the ExpressCount is comprised of an innovative Language-oriented Exemplar Perceptron and a downstream visual Zero-shot Counting pipeline. Thereinto, the perceptron hammers at exploiting accurate exemplar cues from collaborative language-vision signals by inheriting rich semantic priors from the prevailing pre-trained Large Language Models (LLMs), whereas the counting pipeline excels in mining fine-grained features through dual-branch and cross-attention schemes, contributing to the high-quality similarity learning. Apart from building a bridge between the LLM in vogue and the visual counting tasks, expression-guided exemplar estimation significantly advances zero-shot learning capabilities for counting instances with arbitrary classes. Moreover, devising a FSC-147-Express with annotations of meticulous linguistic expressions pioneers a new venue for developing and validating language-based counting models. Extensive experiments demonstrate the state-of-the-art performance of our ExpressCount, even showcasing the accuracy on par with partial CSC models.
