Table of Contents
Fetching ...

FGA: Fourier-Guided Attention Network for Crowd Count Estimation

Yashwardhan Chaudhuri, Ankit Kumar, Arun Balaji Buduru, Adel Alshamrani

TL;DR

This work addresses the challenge of capturing full-scale global patterns in crowd counting by introducing Fourier-Guided Attention (FGA), a dual-path module that combines Fast Fourier Convolution in the frequency domain with traditional spatial convolutions and attention mechanisms. By splitting input features into global and local streams and applying spectral processing alongside spatial/channel attention, FGA delivers improved density-map regression when plugged into CSRNet and CANNet across ShanghaiTech, UCF-CC-50, and JHU++ datasets, as evidenced by reduced MSE and MAE and supported by Grad-CAM interpretability. Key contributions include the dual-path architectural design, a detailed spectral block implementation, and comprehensive ablations demonstrating the necessity of combining FFT with attention. The results suggest that FGA provides a practical and scalable pathway to better crowd counting in diverse scenes, with potential for broader adoption as a plug-in module in existing CNN-based estimators.

Abstract

Crowd counting is gaining societal relevance, particularly in domains of Urban Planning, Crowd Management, and Public Safety. This paper introduces Fourier-guided attention (FGA), a novel attention mechanism for crowd count estimation designed to address the inefficient full-scale global pattern capture in existing works on convolution-based attention networks. FGA efficiently captures multi-scale information, including full-scale global patterns, by utilizing Fast-Fourier Transformations (FFT) along with spatial attention for global features and convolutions with channel-wise attention for semi-global and local features. The architecture of FGA involves a dual-path approach: (1) a path for processing full-scale global features through FFT, allowing for efficient extraction of information in the frequency domain, and (2) a path for processing remaining feature maps for semi-global and local features using traditional convolutions and channel-wise attention. This dual-path architecture enables FGA to seamlessly integrate frequency and spatial information, enhancing its ability to capture diverse crowd patterns. We apply FGA in the last layers of two popular crowd-counting works, CSRNet and CANNet, to evaluate the module's performance on benchmark datasets such as ShanghaiTech-A, ShanghaiTech-B, UCF-CC-50, and JHU++ crowd. The experiments demonstrate a notable improvement across all datasets based on Mean-Squared-Error (MSE) and Mean-Absolute-Error (MAE) metrics, showing comparable performance to recent state-of-the-art methods. Additionally, we illustrate the interpretability using qualitative analysis, leveraging Grad-CAM heatmaps, to show the effectiveness of FGA in capturing crowd patterns.

FGA: Fourier-Guided Attention Network for Crowd Count Estimation

TL;DR

This work addresses the challenge of capturing full-scale global patterns in crowd counting by introducing Fourier-Guided Attention (FGA), a dual-path module that combines Fast Fourier Convolution in the frequency domain with traditional spatial convolutions and attention mechanisms. By splitting input features into global and local streams and applying spectral processing alongside spatial/channel attention, FGA delivers improved density-map regression when plugged into CSRNet and CANNet across ShanghaiTech, UCF-CC-50, and JHU++ datasets, as evidenced by reduced MSE and MAE and supported by Grad-CAM interpretability. Key contributions include the dual-path architectural design, a detailed spectral block implementation, and comprehensive ablations demonstrating the necessity of combining FFT with attention. The results suggest that FGA provides a practical and scalable pathway to better crowd counting in diverse scenes, with potential for broader adoption as a plug-in module in existing CNN-based estimators.

Abstract

Crowd counting is gaining societal relevance, particularly in domains of Urban Planning, Crowd Management, and Public Safety. This paper introduces Fourier-guided attention (FGA), a novel attention mechanism for crowd count estimation designed to address the inefficient full-scale global pattern capture in existing works on convolution-based attention networks. FGA efficiently captures multi-scale information, including full-scale global patterns, by utilizing Fast-Fourier Transformations (FFT) along with spatial attention for global features and convolutions with channel-wise attention for semi-global and local features. The architecture of FGA involves a dual-path approach: (1) a path for processing full-scale global features through FFT, allowing for efficient extraction of information in the frequency domain, and (2) a path for processing remaining feature maps for semi-global and local features using traditional convolutions and channel-wise attention. This dual-path architecture enables FGA to seamlessly integrate frequency and spatial information, enhancing its ability to capture diverse crowd patterns. We apply FGA in the last layers of two popular crowd-counting works, CSRNet and CANNet, to evaluate the module's performance on benchmark datasets such as ShanghaiTech-A, ShanghaiTech-B, UCF-CC-50, and JHU++ crowd. The experiments demonstrate a notable improvement across all datasets based on Mean-Squared-Error (MSE) and Mean-Absolute-Error (MAE) metrics, showing comparable performance to recent state-of-the-art methods. Additionally, we illustrate the interpretability using qualitative analysis, leveraging Grad-CAM heatmaps, to show the effectiveness of FGA in capturing crowd patterns.
Paper Structure (17 sections, 14 equations, 6 figures, 4 tables)

This paper contains 17 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Left: Image of a crowd scene as input to the neural network. Right: Density map of The crowd scene. Brighter spots are noticed in the top right corner of the density map where crowd density is higher and becomes less visible towards the left where there Is less crowd density.
  • Figure 2: FGA Module: The module has two feature extraction sections, as shown in the above figure. Left: The local feature extraction takes a fraction of the input feature maps for local processing. Right: The global feature extraction takes another fraction as input from feature maps for global processing. The spectral block captures full-scale global features. ##: Refer to Figure 3 for more details. #: refer to Figure 4 for more details on attention blocks.
  • Figure 3: Spectral Block: The image above explains the functioning of the spectral block. Conv-BN-ReLU: refers to convolution, batch normalization, ReLU combination. Real-FFT2d: refers to the fast-Fourier transformation of real features. Inv-FFT2d: Refers to Fourier domain to real domain transformation. Conv 1x1 : refers to 1x1 size convolutions.
  • Figure 4: Attention Blocks: The image above shows spatial attention used in the global extraction block in Figure 2. and the channel attention block in the local feature extraction block. R: Resize T: Transpose
  • Figure 5: Counting samples from varying crowd distributions in ShanghaiTech-B combination: Each output density map is shown right adjacent to the input image when given to two different models with FGA module.
  • ...and 1 more figures