Table of Contents
Fetching ...

MCNet: A crowd denstity estimation network based on integrating multiscale attention module

Qiang Guo, Rubo Zhang, Di Zhao

TL;DR

MCNet tackles metro crowd-density estimation by combining a lightweight texture-feature extractor with an Integrating Multiscale Attention (IMA) module to capture wide, multi-scale crowd activations. The IMA module fuses dilation convolutions, multi-branch features, and an attention gate to strengthen crowd texture activations, which are then fed into a compact classifier to predict three density levels. Across CIFAR-10, PETS2009, Mall, QUT, and SH_METRO, MCNet achieves competitive accuracy with a very small parameter count and fast inference, and remains feasible on embedded RK3399 hardware, albeit with some power-cost trade-offs when the IMA module is used. These results indicate practical viability for real-time metro surveillance and potential applicability to other embedded-vision tasks requiring efficient, multi-scale texture modeling.

Abstract

Aiming at the metro video surveillance system has not been able to effectively solve the metro crowd density estimation problem, a Metro Crowd density estimation Network (called MCNet) is proposed to automatically classify crowd density level of passengers. Firstly, an Integrating Multi-scale Attention (IMA) module is proposed to enhance the ability of the plain classifiers to extract semantic crowd texture features to accommodate to the characteristics of the crowd texture feature. The innovation of the IMA module is to fuse the dilation convolution, multiscale feature extraction and attention mechanism to obtain multi-scale crowd feature activation from a larger receptive field with lower computational cost, and to strengthen the crowds activation state of convolutional features in top layers. Secondly, a novel lightweight crowd texture feature extraction network is proposed, which can directly process video frames and automatically extract texture features for crowd density estimation, while its faster image processing speed and fewer network parameters make it flexible to be deployed on embedded platforms with limited hardware resources. Finally, this paper integrates IMA module and the lightweight crowd texture feature extraction network to construct the MCNet, and validate the feasibility of this network on image classification dataset: Cifar10 and four crowd density datasets: PETS2009, Mall, QUT and SH_METRO to validate the MCNet whether can be a suitable solution for crowd density estimation in metro video surveillance where there are image processing challenges such as high density, high occlusion, perspective distortion and limited hardware resources.

MCNet: A crowd denstity estimation network based on integrating multiscale attention module

TL;DR

MCNet tackles metro crowd-density estimation by combining a lightweight texture-feature extractor with an Integrating Multiscale Attention (IMA) module to capture wide, multi-scale crowd activations. The IMA module fuses dilation convolutions, multi-branch features, and an attention gate to strengthen crowd texture activations, which are then fed into a compact classifier to predict three density levels. Across CIFAR-10, PETS2009, Mall, QUT, and SH_METRO, MCNet achieves competitive accuracy with a very small parameter count and fast inference, and remains feasible on embedded RK3399 hardware, albeit with some power-cost trade-offs when the IMA module is used. These results indicate practical viability for real-time metro surveillance and potential applicability to other embedded-vision tasks requiring efficient, multi-scale texture modeling.

Abstract

Aiming at the metro video surveillance system has not been able to effectively solve the metro crowd density estimation problem, a Metro Crowd density estimation Network (called MCNet) is proposed to automatically classify crowd density level of passengers. Firstly, an Integrating Multi-scale Attention (IMA) module is proposed to enhance the ability of the plain classifiers to extract semantic crowd texture features to accommodate to the characteristics of the crowd texture feature. The innovation of the IMA module is to fuse the dilation convolution, multiscale feature extraction and attention mechanism to obtain multi-scale crowd feature activation from a larger receptive field with lower computational cost, and to strengthen the crowds activation state of convolutional features in top layers. Secondly, a novel lightweight crowd texture feature extraction network is proposed, which can directly process video frames and automatically extract texture features for crowd density estimation, while its faster image processing speed and fewer network parameters make it flexible to be deployed on embedded platforms with limited hardware resources. Finally, this paper integrates IMA module and the lightweight crowd texture feature extraction network to construct the MCNet, and validate the feasibility of this network on image classification dataset: Cifar10 and four crowd density datasets: PETS2009, Mall, QUT and SH_METRO to validate the MCNet whether can be a suitable solution for crowd density estimation in metro video surveillance where there are image processing challenges such as high density, high occlusion, perspective distortion and limited hardware resources.
Paper Structure (16 sections, 10 equations, 8 figures, 8 tables)

This paper contains 16 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The image processing pipeline of the feature-based crowd density estimation method.
  • Figure 2: The architecture of the lightweight crowd texture feature extraction network, where "CONV" represents the convolutional layer and "filter" denotes convolutional filter with the shape of output channels $\times$ input channels $\times$ filter height $\times$ filter width. "F" represents the filters shape of the fire module. "Pooling" represents the pooling layer. The ReLU layers after the convolutional layers are omitted for clear presentation.
  • Figure 3: The structure of the IMA module, where "DCONV1-3" and "CONV1-3" represent the dilation convolutional layer and convolutional layer of different branches. "$3\times3$" and "$1\times1$" denote their filter shape. "d" represents the dilation rate. "Att_softmax1-3" denote softmax layer of different branches to realize gate mechanism.
  • Figure 4: Illustration of the mechanism of the IMA module. The red rectangle in the subgraph a represents a convolutional filter with filter size of "$3\times3$". In the subgraph b, the dotted lines with different color denote the receptive field of the different dilation convolutional layers. The red rectangles with black shading in the subgraph c means that attention mechanism has been incorporated to strengthen crowd features.
  • Figure 5: The architecture of the MCNet.
  • ...and 3 more figures