FGA: Fourier-Guided Attention Network for Crowd Count Estimation
Yashwardhan Chaudhuri, Ankit Kumar, Arun Balaji Buduru, Adel Alshamrani
TL;DR
This work addresses the challenge of capturing full-scale global patterns in crowd counting by introducing Fourier-Guided Attention (FGA), a dual-path module that combines Fast Fourier Convolution in the frequency domain with traditional spatial convolutions and attention mechanisms. By splitting input features into global and local streams and applying spectral processing alongside spatial/channel attention, FGA delivers improved density-map regression when plugged into CSRNet and CANNet across ShanghaiTech, UCF-CC-50, and JHU++ datasets, as evidenced by reduced MSE and MAE and supported by Grad-CAM interpretability. Key contributions include the dual-path architectural design, a detailed spectral block implementation, and comprehensive ablations demonstrating the necessity of combining FFT with attention. The results suggest that FGA provides a practical and scalable pathway to better crowd counting in diverse scenes, with potential for broader adoption as a plug-in module in existing CNN-based estimators.
Abstract
Crowd counting is gaining societal relevance, particularly in domains of Urban Planning, Crowd Management, and Public Safety. This paper introduces Fourier-guided attention (FGA), a novel attention mechanism for crowd count estimation designed to address the inefficient full-scale global pattern capture in existing works on convolution-based attention networks. FGA efficiently captures multi-scale information, including full-scale global patterns, by utilizing Fast-Fourier Transformations (FFT) along with spatial attention for global features and convolutions with channel-wise attention for semi-global and local features. The architecture of FGA involves a dual-path approach: (1) a path for processing full-scale global features through FFT, allowing for efficient extraction of information in the frequency domain, and (2) a path for processing remaining feature maps for semi-global and local features using traditional convolutions and channel-wise attention. This dual-path architecture enables FGA to seamlessly integrate frequency and spatial information, enhancing its ability to capture diverse crowd patterns. We apply FGA in the last layers of two popular crowd-counting works, CSRNet and CANNet, to evaluate the module's performance on benchmark datasets such as ShanghaiTech-A, ShanghaiTech-B, UCF-CC-50, and JHU++ crowd. The experiments demonstrate a notable improvement across all datasets based on Mean-Squared-Error (MSE) and Mean-Absolute-Error (MAE) metrics, showing comparable performance to recent state-of-the-art methods. Additionally, we illustrate the interpretability using qualitative analysis, leveraging Grad-CAM heatmaps, to show the effectiveness of FGA in capturing crowd patterns.
