A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd Counting
Yashwardhan Chaudhuri, Ankit Kumar, Orchid Chetia Phukan, Arun Balaji Buduru
TL;DR
The paper tackles crowd counting on resource-constrained devices by introducing ASFNet, a lightweight architecture that uses two backbones, MobileNetV2 and MobileViT, with a common downstream network and Adjacent Semantics Fusion to generate density maps. It formalizes a feature fusion pipeline and trainable feature weighting, plus model compression via pruning and quantization to reduce parameters and FLOPs while maintaining accuracy. The training objective uses a pixel-wise $L_2$ loss $L(\\Theta) = \\frac{1}{2n} \\sum_{i=1}^{n} \\| h(\\Theta, x_i) - G \\|_2^2$, supporting quantitative learning of density maps. The authors report competitive MSE/MAE on ShanghaiTech A/B and UCF-CC-50 with significantly lower complexity, validated by ablations showing the benefits of adjacent feature fusion.
Abstract
Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly. This approach empowers our models to achieve improved performance while maintaining a compact and efficient design. With the comparison of our proposed models with previously available state-of-the-art (SOTA) methods on ShanghaiTech-A ShanghaiTech-B and UCF-CC-50 dataset, it achieves comparable results while being the most computationally efficient model. Finally, we present a comparative study, an extensive ablation study, along with pruning to show the effectiveness of our models.
