Table of Contents
Fetching ...

CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

Yiming Ma, Victor Sanchez, Tanaya Guha

TL;DR

This work addresses the challenge of using CLIP for crowd counting by reframing counting as a blockwise classification task within the Enhanced Blockwise Classification (EBC) framework. EBC replaces real-valued count bins with integer-valued bins, introduces noise reduction, and couples classification with a count-aware loss L_DACE to improve regression metrics. Building on EBC, CLIP-EBC becomes the first fully CLIP-based crowd counting model that generates density maps by aligning local image features with text prompts for each bin. Across multiple benchmarks, EBC yields substantial improvements over prior classification-based methods, while CLIP-EBC achieves state-of-the-art results on the NWPU-Crowd test set and competitive performance elsewhere, demonstrating that CLIP can effectively support dense density estimation when guided by bin-based, loss-aware design.

Abstract

We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.

CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

TL;DR

This work addresses the challenge of using CLIP for crowd counting by reframing counting as a blockwise classification task within the Enhanced Blockwise Classification (EBC) framework. EBC replaces real-valued count bins with integer-valued bins, introduces noise reduction, and couples classification with a count-aware loss L_DACE to improve regression metrics. Building on EBC, CLIP-EBC becomes the first fully CLIP-based crowd counting model that generates density maps by aligning local image features with text prompts for each bin. Across multiple benchmarks, EBC yields substantial improvements over prior classification-based methods, while CLIP-EBC achieves state-of-the-art results on the NWPU-Crowd test set and competitive performance elsewhere, demonstrating that CLIP can effectively support dense density estimation when guided by bin-based, loss-aware design.

Abstract

We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.
Paper Structure (13 sections, 8 equations, 6 figures, 5 tables)

This paper contains 13 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of our model CLIP-EBC and other CLIP-based crowd counting methods. Top: CrowdCLIP liang2023crowdclip predicts total counts from global image features. Due to the incapability to capture spatial details, it suffers from reduced accuracy. Middle: CLIP-Count jiang2023clip employs extra modules to generate density maps, improving accuracy at the cost of increased complexity. Bottom: CLIP-EBC is the first fully CLIP-based density-map approach, achieving high accuracy without additional modules.
  • Figure 2: Overview of the EBC framework and the CLIP-EBC model. The EBC framework introduces an integer-based quantization strategy and noise reduction to partition count values into $n$ predefined bins (e.g., $\{0\}, \{1\}, \cdots, [m, \infty)$). Given an input image, the crowd counting model implemented under EBC generates a blockwise probability map. The density map is obtained by averaging the bin midpoints weighted by the probability map. Finally, the predicted probability map and the density map are fed into the proposed $\mathcal{L}_\text{DACE}$ loss (Eq. \ref{['eqn:loss']}), which is used to fine-tune the model. CLIP-EBC, an instance of EBC, leverages the predefined bins and the prompt template to construct $n$ text prompts (e.g., "There is 0 person"). The frozen CLIP radford2021learning text encoder extracts text embeddings, while the fine-tuned CLIP image encoder generates image feature maps. For the image feature vector at each spatial location, we calculate its cosine similarity with the text embedding for each bin and then apply softmax to obtain the probability map. Subsequent steps align with the EBC framework to generate the final outputs.
  • Figure 3: An example from ShanghaiTech A dataset zhang2016single illustrating an extremely dense area with erroneous annotations. The $8 \times 8$ magenta box in Fig. \ref{['fig:dense_2']} highlights a congested region with an labeled count of 9 individuals. However, the zoomed-in view in Fig. \ref{['fig:dense_3']} reveals that no discernible human figures can be identified within the marked box, evidencing label noise in highly dense regions.
  • Figure 4: Visualization of density maps predicted by CLIP-EBC.
  • Figure 5: Influence of the number of learnable tokens in our CLIP-EBC model with the ViT-B/16 backbone. The Experiments are conducted on the ShanghaiTech A dataset. The results show that setting the number of learnable tokens to 32 yields optimal performance, achieving the lowest values for both MAE and RMSE.
  • ...and 1 more figures