Table of Contents
Fetching ...

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound

TL;DR

This work tackles zero-shot object counting by addressing text insensitivity in text-guided paradigms. It introduces T2ICount, a diffusion-model–based framework that uses single-step features for efficiency but incorporates a Hierarchical Semantic Correction Module and a Representational Regional Coherence Loss to restore strong text–image alignment. The authors also curate FSC-147-S, a harder evaluation subset that stresses counting of text-specified categories beyond the majority class, and demonstrate state-of-the-art performance on FSC-147 and FSC-147-S, with competitive results on CARPK. Overall, the approach provides a practical, cross-modal counting solution with robust text-conditioned supervision and a more rigorous evaluation protocol for text-guided counting.

Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

TL;DR

This work tackles zero-shot object counting by addressing text insensitivity in text-guided paradigms. It introduces T2ICount, a diffusion-model–based framework that uses single-step features for efficiency but incorporates a Hierarchical Semantic Correction Module and a Representational Regional Coherence Loss to restore strong text–image alignment. The authors also curate FSC-147-S, a harder evaluation subset that stresses counting of text-specified categories beyond the majority class, and demonstrate state-of-the-art performance on FSC-147 and FSC-147-S, with competitive results on CARPK. Overall, the approach provides a practical, cross-modal counting solution with robust text-conditioned supervision and a more rigorous evaluation protocol for text-guided counting.

Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.

Paper Structure

This paper contains 15 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualizations of density maps predicted by official pretrained models of two recently proposed text-guided zero-shot object counting methods, CLIP-Count jiang2023clip and VLCounter kang2024vlcounter, which demonstrate poor text sensitivity compared to the proposed T2ICount.
  • Figure 2: Overview of the proposed T2ICount. Our method is based on single denoising step. An input image and text prompts specifying the category to be counted are fed into the pre-trained text-to-image diffusion model. Feature maps extracted from the decoder of the U-Net are passed through the Hierarchical Semantic Correction Module to enhance textual awareness, producing the final features used to estimate the density map. Text-image similarity maps are generated at intermediate stages and are supervised by the Representational Regional Coherence Loss. The ground-truth density map and the fused cross attention maps ($\widehat{\mathcal{A}}^{cross}$) are used to generate the positive-negative-ambiguous (PNA) map, providing supervision signals for this loss. In the training process, the VAE encoder and the text encoder are frozen while the U-Net and HSCM are being trained.
  • Figure 3: Visualization of the issue of text sensitivity and key maps in supervision signal generation of $\mathcal{L}_{\text{RRC}}$. (a-c) Cross-attention maps from different layers of pre-trained Stable Diffusion v1.5 Rombach2022SD, demonstrating weak text-image sensitivity in single-step denoising; (d-f) Key intermediate maps for constructing supervision signals: (d) fused cross-attention map, (e) derived pseudo-background map (white: foreground, black: background), and (f) positive-negative-ambiguous map (white: positive, black: negative, gray: ambiguous regions)
  • Figure 4: Qualitative comparison of T2ICount with VLCounter kang2024vlcounter. With our proposed $\mathcal{L}_{\text{RRC}}$, our text-image similarity map exhibits reduced noise and more precise object delineation, which results a more accurate density estimation.
  • Figure 5: Qualitative results of T2ICount. Each pair shows the predicted density map (left) and the corresponding text-image similarity map (right), where the similarity maps effectively delineate the overall shapes of text-specified objects.