Table of Contents
Fetching ...

FGENet: Fine-Grained Extraction Network for Congested Crowd Counting

Hao-Yuan Ma, Li Zhang, Xiang-Yi Wei

TL;DR

FGENet tackles precise crowd counting under annotation noise by shifting from density maps to a point-based framework. It introduces a FasterNet-L–based backbone with a Fine-Grained Feature Pyramid (FGFP) neck and a Three-Task Combination (TTC) loss, guided by Hungarian matching to align predicted and ground-truth points. The approach achieves state-of-the-art performance on challenging datasets such as ShanghaiTech Part A and UCF_CC_50, with ablations confirming the effectiveness of FGFP and TTC in preserving fine-grained information and mitigating label noise. This method offers robust counting in high-density scenes and has practical implications for real-world crowd analysis, though it incurs computational cost due to the matching process.

Abstract

Crowd counting has gained significant popularity due to its practical applications. However, mainstream counting methods ignore precise individual localization and suffer from annotation noise because of counting from estimating density maps. Additionally, they also struggle with high-density images.To address these issues, we propose an end-to-end model called Fine-Grained Extraction Network (FGENet). Different from methods estimating density maps, FGENet directly learns the original coordinate points that represent the precise localization of individuals.This study designs a fusion module, named Fine-Grained Feature Pyramid(FGFP), that is used to fuse feature maps extracted by the backbone of FGENet. The fused features are then passed to both regression and classification heads, where the former provides predicted point coordinates for a given image, and the latter determines the confidence level for each predicted point being an individual. At the end, FGENet establishes correspondences between prediction points and ground truth points by employing the Hungarian algorithm. For training FGENet, we design a robust loss function, named Three-Task Combination (TTC), to mitigate the impact of annotation noise. Extensive experiments are conducted on four widely used crowd counting datasets. Experimental results demonstrate the effectiveness of FGENet. Notably, our method achieves a remarkable improvement of 3.14 points in Mean Absolute Error (MAE) on the ShanghaiTech Part A dataset, showcasing its superiority over the existing state-of-the-art methods. Even more impressively, FGENet surpasses previous benchmarks on the UCF\_CC\_50 dataset with an astounding enhancement of 30.16 points in MAE.

FGENet: Fine-Grained Extraction Network for Congested Crowd Counting

TL;DR

FGENet tackles precise crowd counting under annotation noise by shifting from density maps to a point-based framework. It introduces a FasterNet-L–based backbone with a Fine-Grained Feature Pyramid (FGFP) neck and a Three-Task Combination (TTC) loss, guided by Hungarian matching to align predicted and ground-truth points. The approach achieves state-of-the-art performance on challenging datasets such as ShanghaiTech Part A and UCF_CC_50, with ablations confirming the effectiveness of FGFP and TTC in preserving fine-grained information and mitigating label noise. This method offers robust counting in high-density scenes and has practical implications for real-world crowd analysis, though it incurs computational cost due to the matching process.

Abstract

Crowd counting has gained significant popularity due to its practical applications. However, mainstream counting methods ignore precise individual localization and suffer from annotation noise because of counting from estimating density maps. Additionally, they also struggle with high-density images.To address these issues, we propose an end-to-end model called Fine-Grained Extraction Network (FGENet). Different from methods estimating density maps, FGENet directly learns the original coordinate points that represent the precise localization of individuals.This study designs a fusion module, named Fine-Grained Feature Pyramid(FGFP), that is used to fuse feature maps extracted by the backbone of FGENet. The fused features are then passed to both regression and classification heads, where the former provides predicted point coordinates for a given image, and the latter determines the confidence level for each predicted point being an individual. At the end, FGENet establishes correspondences between prediction points and ground truth points by employing the Hungarian algorithm. For training FGENet, we design a robust loss function, named Three-Task Combination (TTC), to mitigate the impact of annotation noise. Extensive experiments are conducted on four widely used crowd counting datasets. Experimental results demonstrate the effectiveness of FGENet. Notably, our method achieves a remarkable improvement of 3.14 points in Mean Absolute Error (MAE) on the ShanghaiTech Part A dataset, showcasing its superiority over the existing state-of-the-art methods. Even more impressively, FGENet surpasses previous benchmarks on the UCF\_CC\_50 dataset with an astounding enhancement of 30.16 points in MAE.
Paper Structure (17 sections, 6 equations, 3 figures, 4 tables)

This paper contains 17 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An example from the ShangHaiTech PartA dataset, (a) various noise caused by data annotation ((1) label noise, (2) missing annotation, and (3) overlapping effect caused by Gaussian kernels), (b) the ground truth under the point framework, (c) the prediction map generated by FGENet, and (d) the ground truth under the density-map framework.
  • Figure 2: Structure of the FGENet.
  • Figure 3: Three crowd images of SHT_A and their predictions obtained by FGENet.