Table of Contents
Fetching ...

Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

Da Zhang, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao

Abstract

Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.

Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

Abstract

Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss () is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.
Paper Structure (19 sections, 10 equations, 7 figures, 5 tables)

This paper contains 19 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A comparison between (a) standard ZSOC and (b) QICA. (a) Existing methods typically rely on unimodal prompting and direct feature interaction, suffering from a lack of fine-grained quantity awareness and spatial insensitivity, which often leads to overfitting. (b) QICA introduces numerically conditional collaborative prompts and achieves accurate density estimation through cost aggregation decoding via $\mathcal{L}_{MQA}$.
  • Figure 2: Overall architecture ofQICA. (a) SPS jointly adapts the frozen vision and language encoders by mapping quantity-aware text prompts to visual prompts via a coupling function ($\Phi$). (b) CAD first computes a similarity map between dense visual features ($\text{V}$) and category-only text embeddings ($\text{T}^{cat}$), then refines this map through spatial aggregation and multi-scale upsampling to predict the final density map. The entire framework is supervised by $\mathcal{L}_{MQA}$. Notably, the pathways involving explicit quantity information (Quantity Embed and $\text{T}^{full}$) are active only during training, while the model uses only category information at inference to ensure zero-shot generalization.
  • Figure 3: Visualization of the CAD pipeline. (a) Original image. (b) The similarity map derived from the fine-tuned CLIP exhibits fine-grained quantity awareness, effectively highlighting target instances while suppressing noise. (c) The aggregated cost map and (d) the final fused map demonstrate the recovery of fine-grained spatial structure, transforming coarse activations into precise density predictions.
  • Figure 4: Ablation on prompt depth (a) and prompt length (b) in QICA. We report results on the validation sets of FSC-147.
  • Figure 5: Sensitivity analysis of loss weights on FSC-147 validation set. Left shows the effect of $\lambda_1$ on $\mathcal{L}_{\text{enc}}^{\text{qty}}$ while right illustrates $\lambda_2$ on $\mathcal{L}_{\text{dec}}^{\text{qty}}$. ★ Gold stars mark optimal configurations.
  • ...and 2 more figures