MMDEW: Multipurpose Multiclass Density Estimation in the Wild
Villanelle O'Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James Brown
TL;DR
MMDEW tackles multiclass crowd counting in challenging, dense, and occluded scenes by integrating a Twins-SVT backbone with a multiscale decoder and a dedicated Category Focus Module. A regional loss suppresses inter-class cross-talk and enables cross-domain generalization, demonstrated on VisDrone, iSAID, and a biodiversity dataset, with substantial reductions in MAE compared to prior multiclass methods. The approach also employs a lightweight segmentation task during training to guide feature learning without relying on masks at inference, and ground-truth density maps are generated from centroid-based annotations to preserve counts. Together, these elements yield state-of-the-art or competitive results across multiple benchmarks and domain settings, highlighting practical impact for ecological monitoring and large-scale counting tasks.
Abstract
Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.
