Table of Contents
Fetching ...

Decoupling What to Count and Where to See for Referring Expression Counting

Yuda Zou, Zijian Zhang, Yongchao Xu

TL;DR

This work tackles the misalignment between class-centric annotation points and attribute-defining visual regions in Referring Expression Counting (REC). It introduces W2-Net, a dual-query decoder that concurrently reasons about what to count (w2c) and where to see (w2s), enabling explicit attribute-guided localization and improved subclass discrimination. A novel Subclass Separable Matching (SSM) with an exponential repulsive term stabilizes training by reducing inter-subclass ambiguity, yielding state-of-the-art results on REC-8K with substantial reductions in counting error and gains in localization accuracy. The approach demonstrates strong generalization in zero-shot and cross-dataset settings (FSC-147 and CARPK) and highlights the importance of aligning supervisory signals with attribute-defining regions for practical counting systems.

Abstract

Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

Decoupling What to Count and Where to See for Referring Expression Counting

TL;DR

This work tackles the misalignment between class-centric annotation points and attribute-defining visual regions in Referring Expression Counting (REC). It introduces W2-Net, a dual-query decoder that concurrently reasons about what to count (w2c) and where to see (w2s), enabling explicit attribute-guided localization and improved subclass discrimination. A novel Subclass Separable Matching (SSM) with an exponential repulsive term stabilizes training by reducing inter-subclass ambiguity, yielding state-of-the-art results on REC-8K with substantial reductions in counting error and gains in localization accuracy. The approach demonstrates strong generalization in zero-shot and cross-dataset settings (FSC-147 and CARPK) and highlights the importance of aligning supervisory signals with attribute-defining regions for practical counting systems.

Abstract

Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

Paper Structure

This paper contains 35 sections, 13 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of the core challenge in REC we address. REC point annotations (green pentastar), placed on class-representative locations like heads, provide insufficient guidance for attribute-specific regions (e.g., legs for "standing" or "walking"). This hinders the model from distinguishing fine-grained subclasses ("person standing" and "person walking"). Our W2-Net introduces a dedicated where-to-see (w2s) queries (yellow pentastar) that actively seek attribute-relevant visual cues. By fusing these features to the corresponding standard what-to-count (w2c), the model achieves precise subclass discrimination. The attention points visualize the attention focus of each query type. Best viewed by zooming in the electronic version.
  • Figure 2: The framework of W2-Net.W2-Net decouples "what to count" and "where to see" in the proposed W2 Decoder, where the what-to-count (w2c) query targets at locating the object's class-representative center and the parallel dedicated where-to-see (w2s) query grounds the distinguishing attribute. Fusing their features enables precise subclass discrimination. Besides, we develop the Subclass Separable Matching (SSM) to stabilize training by introducing a repulsive force into the matching cost, effectively resolving inter-subclass ambiguity and ensuring stable supervision.
  • Figure 3: Effectiveness of Subclass Separable Matching (SSM) in resolving training ambiguity. A standard matching approach (blue) consistently suffers from high ambiguity due to inter-subclass similarity. In contrast, our SSM (orange), which incorporates a repulsive force, alleviates such ambiguity from the beginning of training, ensuring a stable and accurate supervision signal and improved performance.
  • Figure 4: Some qualitative results on the REC-8K dataset of CAD-GD wang2025exploring_CAD-GD and our W2-Net.
  • Figure S-5: Qualitative visualization of the W2-Decoder's mechanism on the REC-8K dataset. Green, blue, and yellow pentastars denote the ground-truth (GT) point, the final predicted point from the what-to-count (w2c) query, and the where-to-see (w2s) query, respectively. Attention points visualize the focus area of each query. (Left Column): The trajectory of our w2s query across six decoder layers, demonstrating its progressive convergence towards attribute-relevant regions. (Middle Column): Synergy in Our W2-Net. The w2c query's attention correctly centers on class-representative areas (e.g., persons' head), while the w2s query's attention seeks out attribute-specific visual cues (e.g., the bicycle for "person riding a bicycle"). This fusion of attribute-aware features enables precise subclass discrimination. (Right Column): The baseline model. Its w2c query, solely supervised by the GT point, focuses narrowly on class features and neglects crucial attribute information, hindering its ability to distinguish between similar subclasses. Notably, for the spatial attribute "bird in the second layer" (bottom row), our w2s query expands its search area to gather global context, enabling precise counting and localization based on relative position.