Table of Contents
Fetching ...

A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd Counting

Yashwardhan Chaudhuri, Ankit Kumar, Orchid Chetia Phukan, Arun Balaji Buduru

TL;DR

The paper tackles crowd counting on resource-constrained devices by introducing ASFNet, a lightweight architecture that uses two backbones, MobileNetV2 and MobileViT, with a common downstream network and Adjacent Semantics Fusion to generate density maps. It formalizes a feature fusion pipeline and trainable feature weighting, plus model compression via pruning and quantization to reduce parameters and FLOPs while maintaining accuracy. The training objective uses a pixel-wise $L_2$ loss $L(\\Theta) = \\frac{1}{2n} \\sum_{i=1}^{n} \\| h(\\Theta, x_i) - G \\|_2^2$, supporting quantitative learning of density maps. The authors report competitive MSE/MAE on ShanghaiTech A/B and UCF-CC-50 with significantly lower complexity, validated by ablations showing the benefits of adjacent feature fusion.

Abstract

Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly. This approach empowers our models to achieve improved performance while maintaining a compact and efficient design. With the comparison of our proposed models with previously available state-of-the-art (SOTA) methods on ShanghaiTech-A ShanghaiTech-B and UCF-CC-50 dataset, it achieves comparable results while being the most computationally efficient model. Finally, we present a comparative study, an extensive ablation study, along with pruning to show the effectiveness of our models.

A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd Counting

TL;DR

The paper tackles crowd counting on resource-constrained devices by introducing ASFNet, a lightweight architecture that uses two backbones, MobileNetV2 and MobileViT, with a common downstream network and Adjacent Semantics Fusion to generate density maps. It formalizes a feature fusion pipeline and trainable feature weighting, plus model compression via pruning and quantization to reduce parameters and FLOPs while maintaining accuracy. The training objective uses a pixel-wise loss , supporting quantitative learning of density maps. The authors report competitive MSE/MAE on ShanghaiTech A/B and UCF-CC-50 with significantly lower complexity, validated by ablations showing the benefits of adjacent feature fusion.

Abstract

Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly. This approach empowers our models to achieve improved performance while maintaining a compact and efficient design. With the comparison of our proposed models with previously available state-of-the-art (SOTA) methods on ShanghaiTech-A ShanghaiTech-B and UCF-CC-50 dataset, it achieves comparable results while being the most computationally efficient model. Finally, we present a comparative study, an extensive ablation study, along with pruning to show the effectiveness of our models.
Paper Structure (11 sections, 7 equations, 2 figures, 3 tables)

This paper contains 11 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architectural depiction of ASFNet-B and ASFNet-S illustrates ASFNet-S leverages a MobileViT backbone, while ASFNet-B employs a MobileNet backbone, common downstream shared by both networks, enabling multi-scale feature extraction.
  • Figure 2: Left: Feature maps from each layer of ASFNet-B, Right: Feature maps from each layer of ASFNet-S. Where F1, F2, F3 and F4 shows the different scale features and features from intermediate layers.