Table of Contents
Fetching ...

Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark

Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Qinghua Hu, Pengfei Zhu

TL;DR

This work tackles the dynamic fore-background imbalance in video object counting by presenting E-MAC, a density-embedded masked autoencoder framework that combines DEMO (density-guided multi-modal self-representation), SAM (density-driven adaptive masking), and TCF (optical-flow–assisted temporal fusion). By treating density maps as an auxiliary modality and leveraging temporal context, E-MAC achieves improved per-frame density regression across multiple datasets, including a new large-scale DroneBird bird-counting benchmark. Key contributions include the DEMO module for cross-modal self-supervision, the SAM mechanism to focus learning on foreground targets, the temporal fusion strategy to exploit inter-frame dynamics, and the DroneBird dataset for natural, drone-view counting. Experimental results show state-of-the-art performance on FDST, Mall, VSCrowd, and DroneBird, with notable gains from ablations and loss-function analyses, validating the approach and its applicability to natural scenes and small-target counting tasks.

Abstract

The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of target objects. This remains understudied in existing works and often leads to severe under-/over-prediction errors. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To empower the model's representation ability on density regression, we develop a new $\mathtt{D}$ensity-$\mathtt{E}$mbedded $\mathtt{M}$asked m$\mathtt{O}$deling ($\mathtt{DEMO}$) method, which first takes the density map as an auxiliary modality to perform multimodal self-representation learning for image and density map. Although $\mathtt{DEMO}$ contributes to effective cross-modal regression guidance, it also brings in redundant background information, making it difficult to focus on the foreground regions. To handle this dilemma, we propose an efficient spatial adaptive masking derived from density maps to boost efficiency. Meanwhile, we employ an optical flow-based temporal collaborative fusion strategy to effectively capture the dynamic variations across frames, aligning features to derive multi-frame density residuals. The counting accuracy of the current frame is boosted by harnessing the information from adjacent frames. In addition, considering that most existing datasets are limited to human-centric scenarios, we first propose a large video bird counting dataset, DroneBird, in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our \textit{DroneBird} validate our superiority against the counterparts. The code and dataset are available.

Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark

TL;DR

This work tackles the dynamic fore-background imbalance in video object counting by presenting E-MAC, a density-embedded masked autoencoder framework that combines DEMO (density-guided multi-modal self-representation), SAM (density-driven adaptive masking), and TCF (optical-flow–assisted temporal fusion). By treating density maps as an auxiliary modality and leveraging temporal context, E-MAC achieves improved per-frame density regression across multiple datasets, including a new large-scale DroneBird bird-counting benchmark. Key contributions include the DEMO module for cross-modal self-supervision, the SAM mechanism to focus learning on foreground targets, the temporal fusion strategy to exploit inter-frame dynamics, and the DroneBird dataset for natural, drone-view counting. Experimental results show state-of-the-art performance on FDST, Mall, VSCrowd, and DroneBird, with notable gains from ablations and loss-function analyses, validating the approach and its applicability to natural scenes and small-target counting tasks.

Abstract

The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of target objects. This remains understudied in existing works and often leads to severe under-/over-prediction errors. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To empower the model's representation ability on density regression, we develop a new ensity-mbedded asked mdeling () method, which first takes the density map as an auxiliary modality to perform multimodal self-representation learning for image and density map. Although contributes to effective cross-modal regression guidance, it also brings in redundant background information, making it difficult to focus on the foreground regions. To handle this dilemma, we propose an efficient spatial adaptive masking derived from density maps to boost efficiency. Meanwhile, we employ an optical flow-based temporal collaborative fusion strategy to effectively capture the dynamic variations across frames, aligning features to derive multi-frame density residuals. The counting accuracy of the current frame is boosted by harnessing the information from adjacent frames. In addition, considering that most existing datasets are limited to human-centric scenarios, we first propose a large video bird counting dataset, DroneBird, in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our \textit{DroneBird} validate our superiority against the counterparts. The code and dataset are available.

Paper Structure

This paper contains 25 sections, 6 equations, 12 figures, 12 tables, 2 algorithms.

Figures (12)

  • Figure 1: The chord diagram illustrates the associations between various attributes of our proposed dataset. Each attribute showcases a portion of the dataset's examples as references. We provide two zoomed-in examples for better visualization. The right part represents the experimental result of our proposed method and previous video counting method on each attribute of our DroneBird dataset.
  • Figure 2: An overview of our E-MAC. For the temporal collaborative fusion, we use optical flow to fuse multi-frame density maps. For the density-embedded masked modeling, the image and density map are treated as multi-modal data and are fed into the transformer autoencoder for self-representation masked modeling simultaneously. The spatial adaptive masking uses the density map to balance the dynamic fore-background. During inference, the density map is fully masked.
  • Figure 3: Visualized comparisons on the FDST dataset and the Mall dataset.
  • Figure 4: Visualized comparisons on the VScrowd dataset and our DroneBird dataset.
  • Figure 5: Hyperparameter analysis of background retention probability, mask ratio, and loss weights.
  • ...and 7 more figures