Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks

Xiaopin Zhong; Guankun Wang; Weixiang Liu; Zongze Wu; Yuanlong Deng

Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks

Xiaopin Zhong, Guankun Wang, Weixiang Liu, Zongze Wu, Yuanlong Deng

TL;DR

This work tackles crowd counting by reframing detection losses to handle dense, cluttered scenes and center-point annotations through Mask Focal Loss (MFL). By leveraging Gaussian heatmaps and an area-based positive mask, MFL unifies loss calculations for heatmap and binary feature map detectors, addressing sample imbalance and spatial coherence issues. The authors introduce GTA_Head, a large synthetic dataset with bounding-box annotations to benchmark detection-based counting, and demonstrate that MFL consistently improves MAE and RMSE across detectors and datasets, with anchor-free models showing the strongest gains. These findings suggest that density estimation-based approaches can be outperformed by well-calibrated detection pipelines when trained with MFL, offering a practical foundation for real-time, scalable crowd counting in complex scenes.

Abstract

As a fundamental computer vision task, crowd counting plays an important role in public safety. Currently, deep learning based head detection is a promising method for crowd counting. However, the highly concerned object detection networks cannot be well applied to this problem for three reasons: (1) Existing loss functions fail to address sample imbalance in highly dense and complex scenes; (2) Canonical object detectors lack spatial coherence in loss calculation, disregarding the relationship between object location and background region; (3) Most of the head detection datasets are only annotated with the center points, i.e. without bounding boxes. To overcome these issues, we propose a novel Mask Focal Loss (MFL) based on heatmap via the Gaussian kernel. MFL provides a unifying framework for the loss functions based on both heatmap and binary feature map ground truths. Additionally, we introduce GTA_Head, a synthetic dataset with comprehensive annotations, for evaluation and comparison. Extensive experimental results demonstrate the superior performance of our MFL across various detectors and datasets, and it can reduce MAE and RMSE by up to 47.03% and 61.99%, respectively. Therefore, our work presents a strong foundation for advancing crowd counting methods based on density estimation.

Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks

TL;DR

Abstract

Paper Structure (23 sections, 10 equations, 8 figures, 7 tables)

This paper contains 23 sections, 10 equations, 8 figures, 7 tables.

Introduction
Related works
Regression based methods
Density estimation-based methods
Detection based methods
Mask Focal Loss
Focal Loss and its variants
Mask Focal Loss
Experiments and results
GTA_Head dataset
Evaluation metrics
Ablation study
Configuration
Results and discussion
Evaluation and comparison on different networks
...and 8 more sections

Figures (8)

Figure 1: An example of high dense and complex crowd counting scene. (left) original image, (middle) image with head center labels, (right) image with head bounding box labels.
Figure 2: Timeline of milestone methods. On the left side of the dotted line are traditional methods, and on the right are methods based on deep learning. Milestone methods: RR chen2012feature, CA-RR chen2013cumulative, Count Forest pham2015count, CNN-Boosting walach2016learning, CP-CNN sindagi2017generating, CMTL sindagi2017cnn, Switching CNN babu2017switching, CSRNet li2018csrnet, CANNet liu2019context, SA-Net cao2018scale, ADSCNet bai2020adaptive, SASNet song2021choose, MLR wu2006crowd, KRR an2007face, Faster-OHEM-KCF li2016deep, LC-FCN laradji2018blobs, PSDNN liu2019point, AM-CNN shao2018crowdhuman, LSC-CNN sam2020locate, P2PNet song2021rethinking, Crowd-SDNet wang2021self, DPDNet lian2021locating
Figure 3: For Mask Focal Loss, a target based mask map (right) can be generated from the annotations (left) during training. The mask value in the target bounding box is 1, and that in the background area is 0.
Figure 4: Different $\beta$ leads to different contributions to the total loss.
Figure 5: Given a scene (left), with manually setting several representative anchor boxes (mid) , all pedestrian head sizes can be obtained through linear transformation in an automatic way (right).
...and 3 more figures

Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks

TL;DR

Abstract

Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)