Table of Contents
Fetching ...

Exploring Contextual Attribute Density in Referring Expression Counting

Zhicheng Wang, Zhiyu Pan, Zhan Peng, Jian Cheng, Liwen Xiao, Wei Jiang, Zhiguo Cao

TL;DR

This work defines contextual attribute density (CAD) as the information intensity of fine-grained attributes in visual regions and integrates it into a multi-modal counting framework. The proposed CAD-GD combines a U-shape CADE CAD estimator with a CAD Attention Module and a CAD Dynamic Query Module to inject CAD into visual features and dynamic counting queries, supervised by a density loss. Empirical results on REC-8K, FSC-147, and CARPK show robust gains in counting accuracy and localization, including strong zero-shot performance. The approach highlights the importance of density-aware reasoning for fine-grained cross-modal understanding and offers a scalable pathway to improve open-world counting tasks.

Abstract

Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ''visual density'', it is presumed that the limitations of current REC approaches stem from an under-exploration of ''contextual attribute density'' (CAD). In the scope of REC, we define CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervision, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves $30\%$ error reduction in counting metrics and a $10\%$ improvement in localization accuracy. The surprising results shed light on the significance of contextual attribute density for REC. Code will be at github.com/Xu3XiWang/CAD-GD.

Exploring Contextual Attribute Density in Referring Expression Counting

TL;DR

This work defines contextual attribute density (CAD) as the information intensity of fine-grained attributes in visual regions and integrates it into a multi-modal counting framework. The proposed CAD-GD combines a U-shape CADE CAD estimator with a CAD Attention Module and a CAD Dynamic Query Module to inject CAD into visual features and dynamic counting queries, supervised by a density loss. Empirical results on REC-8K, FSC-147, and CARPK show robust gains in counting accuracy and localization, including strong zero-shot performance. The approach highlights the importance of density-aware reasoning for fine-grained cross-modal understanding and offers a scalable pathway to improve open-world counting tasks.

Abstract

Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ''visual density'', it is presumed that the limitations of current REC approaches stem from an under-exploration of ''contextual attribute density'' (CAD). In the scope of REC, we define CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervision, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves error reduction in counting metrics and a improvement in localization accuracy. The surprising results shed light on the significance of contextual attribute density for REC. Code will be at github.com/Xu3XiWang/CAD-GD.

Paper Structure

This paper contains 27 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The efficacy of contextual attribute density (CAD). According to the qualitative comparison above, the REC baseline dai2024referring often over-counts when it focuses primarily on class information rather than fine-grained attributes, and it under-counts in cases of occlusion or scale variations. By modeling CAD, our method more accurately aligns fine-grained attributes with corresponding visual regions, effectively reducing counting errors.
  • Figure 2: The framework of Contextual Attribute Density Aware GroundingDINO (CAD-GD). A query image and referring expression are sent into backbones separately together with the feature enhancer to get the visual features and the text feature. Then we obtain the Contextual Attribute Density (CAD) features by the CAD Generate Module supervised by the GT contextual attribute density map. To leverage the CAD information, we design the CAD Attention Module and CAD Dynamic Query Module to enhance visual features and query contents separately. Then we sent the dynamic queries, CAD enhanced visual features, and the text feature into the localization decoder to get the final localization prediction.
  • Figure 3: Visualization of queries. We visualize the init queries $Q$, text dynamic initialized $\dot Q$ and the CAD dynamic initialized queries $\hat{Q}$ in the polar coordinate system by t-SNE van2008visualizing. (b) represent the feature distribution of init queries in the polar coordinate system, which are the same for different referring expressions. (c) represent the feature distribution of text init queries. As shown in (c), the distribution of "bluish pen" and "greenish pen" is quite similar even with much overlap. (d) represent the feature distribution of text & density init queries. As shown in (d), the distribution of queries is quite different, which means they can be distinguished easily.
  • Figure 4: Qualitative results on the REC-8k dataset. Columns 1-3 contain examples with direct attributes for REs, and columns 4-6 contain those with context-related attributes. Our method consistently outperforms the GroundingREC with more precious locations.
  • Figure 5: Visualization of contextual attribute density map. The predicted density map can capture the locations of referring expressions.
  • ...and 3 more figures