Multimodal Crowd Counting with Pix2Pix GANs
Muhammad Asif Khan, Hamid Menouar, Ridha Hamila
TL;DR
The paper tackles the data scarcity challenge in multimodal crowd counting under poor illumination by using a Pix2Pix GAN to synthesize thermal imagery from RGB inputs. It introduces MMCount, a two-branch network that fuses RGB and TIR information to produce density maps, with TIR generated either from real sensors or synthetic GANs. Evaluations on DroneRGBT, ShanghaiTech Part-B, and CARPK demonstrate that incorporating synthetic TIR improves counting accuracy over RGB-only baselines, and that generated TIR can approach the performance of real TIR data. This approach enables practical deployment of multimodal crowd counting in low-light conditions and suggests directions for lightweight real-time GANs and broader cross-scene training.
Abstract
Most state-of-the-art crowd counting methods use color (RGB) images to learn the density map of the crowd. However, these methods often struggle to achieve higher accuracy in densely crowded scenes with poor illumination. Recently, some studies have reported improvement in the accuracy of crowd counting models using a combination of RGB and thermal images. Although multimodal data can lead to better predictions, multimodal data might not be always available beforehand. In this paper, we propose the use of generative adversarial networks (GANs) to automatically generate thermal infrared (TIR) images from color (RGB) images and use both to train crowd counting models to achieve higher accuracy. We use a Pix2Pix GAN network first to translate RGB images to TIR images. Our experiments on several state-of-the-art crowd counting models and benchmark crowd datasets report significant improvement in accuracy.
