Decoupling What to Count and Where to See for Referring Expression Counting
Yuda Zou, Zijian Zhang, Yongchao Xu
TL;DR
This work tackles the misalignment between class-centric annotation points and attribute-defining visual regions in Referring Expression Counting (REC). It introduces W2-Net, a dual-query decoder that concurrently reasons about what to count (w2c) and where to see (w2s), enabling explicit attribute-guided localization and improved subclass discrimination. A novel Subclass Separable Matching (SSM) with an exponential repulsive term stabilizes training by reducing inter-subclass ambiguity, yielding state-of-the-art results on REC-8K with substantial reductions in counting error and gains in localization accuracy. The approach demonstrates strong generalization in zero-shot and cross-dataset settings (FSC-147 and CARPK) and highlights the importance of aligning supervisory signals with attribute-defining regions for practical counting systems.
Abstract
Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.
