Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks
Xiaopin Zhong, Guankun Wang, Weixiang Liu, Zongze Wu, Yuanlong Deng
TL;DR
This work tackles crowd counting by reframing detection losses to handle dense, cluttered scenes and center-point annotations through Mask Focal Loss (MFL). By leveraging Gaussian heatmaps and an area-based positive mask, MFL unifies loss calculations for heatmap and binary feature map detectors, addressing sample imbalance and spatial coherence issues. The authors introduce GTA_Head, a large synthetic dataset with bounding-box annotations to benchmark detection-based counting, and demonstrate that MFL consistently improves MAE and RMSE across detectors and datasets, with anchor-free models showing the strongest gains. These findings suggest that density estimation-based approaches can be outperformed by well-calibrated detection pipelines when trained with MFL, offering a practical foundation for real-time, scalable crowd counting in complex scenes.
Abstract
As a fundamental computer vision task, crowd counting plays an important role in public safety. Currently, deep learning based head detection is a promising method for crowd counting. However, the highly concerned object detection networks cannot be well applied to this problem for three reasons: (1) Existing loss functions fail to address sample imbalance in highly dense and complex scenes; (2) Canonical object detectors lack spatial coherence in loss calculation, disregarding the relationship between object location and background region; (3) Most of the head detection datasets are only annotated with the center points, i.e. without bounding boxes. To overcome these issues, we propose a novel Mask Focal Loss (MFL) based on heatmap via the Gaussian kernel. MFL provides a unifying framework for the loss functions based on both heatmap and binary feature map ground truths. Additionally, we introduce GTA_Head, a synthetic dataset with comprehensive annotations, for evaluation and comparison. Extensive experimental results demonstrate the superior performance of our MFL across various detectors and datasets, and it can reduce MAE and RMSE by up to 47.03% and 61.99%, respectively. Therefore, our work presents a strong foundation for advancing crowd counting methods based on density estimation.
