Table of Contents
Fetching ...

MMDEW: Multipurpose Multiclass Density Estimation in the Wild

Villanelle O'Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James Brown

TL;DR

MMDEW tackles multiclass crowd counting in challenging, dense, and occluded scenes by integrating a Twins-SVT backbone with a multiscale decoder and a dedicated Category Focus Module. A regional loss suppresses inter-class cross-talk and enables cross-domain generalization, demonstrated on VisDrone, iSAID, and a biodiversity dataset, with substantial reductions in MAE compared to prior multiclass methods. The approach also employs a lightweight segmentation task during training to guide feature learning without relying on masks at inference, and ground-truth density maps are generated from centroid-based annotations to preserve counts. Together, these elements yield state-of-the-art or competitive results across multiple benchmarks and domain settings, highlighting practical impact for ecological monitoring and large-scale counting tasks.

Abstract

Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

MMDEW: Multipurpose Multiclass Density Estimation in the Wild

TL;DR

MMDEW tackles multiclass crowd counting in challenging, dense, and occluded scenes by integrating a Twins-SVT backbone with a multiscale decoder and a dedicated Category Focus Module. A regional loss suppresses inter-class cross-talk and enables cross-domain generalization, demonstrated on VisDrone, iSAID, and a biodiversity dataset, with substantial reductions in MAE compared to prior multiclass methods. The approach also employs a lightweight segmentation task during training to guide feature learning without relying on masks at inference, and ground-truth density maps are generated from centroid-based annotations to preserve counts. Together, these elements yield state-of-the-art or competitive results across multiple benchmarks and domain settings, highlighting practical impact for ecological monitoring and large-scale counting tasks.

Abstract

Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

Paper Structure

This paper contains 22 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Multipurpose Multi-class Density Estimation. Testing results from our multicategory crowd counting method applied to the hicks2021deep, VisDrone-DETzhu2021detectionVisDrone and iSAIDwaqas2019iSAID datasets.
  • Figure 2: Class Distribution. In multi-class density estimation, each class represents a distinct counting task, so an imbalance in how often classes appear can strongly influence gradients, if most of the counting tasks are optimal at zero for a given sample. The plot illustrates the number of different classes present in each image. VisDrone-DET (both the 8- and 10-class versions) and iSAID images typically contain many, sometimes all, of their classes, whereas samples from the Hicks flower dataset usually feature at most one class. The iSAID distribution is from our patched 4-category subset of iSAID in line with michel2022class, where due to the patching, 42% of samples contain no annotation.
  • Figure 3: Our Model Architecture. Within the Multiscale Aware Module, a concept from yu2025multiscale, although used differently here, the first column of convolutions is followed immediately by column of a batch norm and ReLU activations. The Category Focus Module (CFM) is an extension of the MAM with one additional $Conv \to Conv_{dilated}$ row with a dilation of 4.
  • Figure 4: Model Heads. The two output heads of the model.