Table of Contents
Fetching ...

Embodied Crowd Counting

Runling Long, Yunlong Wang, Jia Wan, Xiang Deng, Xinting Zhu, Weili Guan, Antoni B. Chan, Liqiang Nie

TL;DR

This work introduces Embodied Crowd Counting (ECC) to address occlusion in crowd counting by leveraging drone-based, interactive sensing in large outdoor environments. It provides the Embodied Crowd Counting Dataset (ECCD) to enable large-scale, interactive crowd analysis and proposes ZECC, a zero-shot baseline with three modules—Active Top-down Exploration (ATE), Normal-line based Navigation (NLBN), and Fine Detection and Counting (FDC)—to achieve accurate counting with efficient exploration. ZECC demonstrates favorable trading-off between counting error ($MAPE$) and navigation distance ($TD$) against competitive baselines, and extensive ablations confirm the necessity of each component, with real-world demonstrations verifying robustness to occlusion. The work establishes a new benchmark and methodology for scalable, interactive crowd analysis with potential applications in public safety and urban planning, while acknowledging simulation-based limitations and future work on dynamic targets and real-world deployment.

Abstract

Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior probability distribution that approximates realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed. This method contains a MLLM driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine counting. Experimental results against baselines show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.

Embodied Crowd Counting

TL;DR

This work introduces Embodied Crowd Counting (ECC) to address occlusion in crowd counting by leveraging drone-based, interactive sensing in large outdoor environments. It provides the Embodied Crowd Counting Dataset (ECCD) to enable large-scale, interactive crowd analysis and proposes ZECC, a zero-shot baseline with three modules—Active Top-down Exploration (ATE), Normal-line based Navigation (NLBN), and Fine Detection and Counting (FDC)—to achieve accurate counting with efficient exploration. ZECC demonstrates favorable trading-off between counting error () and navigation distance () against competitive baselines, and extensive ablations confirm the necessity of each component, with real-world demonstrations verifying robustness to occlusion. The work establishes a new benchmark and methodology for scalable, interactive crowd analysis with potential applications in public safety and urban planning, while acknowledging simulation-based limitations and future work on dynamic targets and real-world deployment.

Abstract

Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior probability distribution that approximates realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed. This method contains a MLLM driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine counting. Experimental results against baselines show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.

Paper Structure

This paper contains 29 sections, 12 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 1: (a) Comparison between ECCD and existing embodied navigation datasets. ECCD features large-scale outdoor crowd scenes. (b) Comparison between ECCD and crowd counting datasets. ECCD enables interactive ability. (c) Comparison between ZECC and existing crowd counting methods. ZECC is an agentic framework with automatic camera adjusting ability.
  • Figure 2: (a) ECCD is designed to mimic building and crowd distribution realistically. On the left are samples from ECCD, and on the right are the real scenes. (b) Illustration of the potential navigation vectors, normal lines, and FBE view vectors. Zoom in for better visualization.
  • Figure 3: The proposed framework. First, ATE is proposed to estimate the global crowd distribution efficiently. Then, NLBN is proposed to generate fine observation points, alleviating crowd overlap. The final result is generated by aggregating all fine detections.
  • Figure 4: Performance and cost of ZECC and the baselines under different crowd density levels. L1-L5 refers to increasing density level. The figure demonstrates that ZECC achieves a balance between performance and exploration cost.
  • Figure 5: (a) Comparison of performance-cost trade-off. ZECC achieves a better trade-off when TD is limited. (b) The effect of four hyper parameters in ZECC. It shows that ZECC is effecive when the hyper-parameters are set in reasonable scopes.
  • ...and 6 more figures