Table of Contents
Fetching ...

IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

Sebastian-Ion Nae, Radu Moldoveanu, Alexandra Stefania Ghita, Adina Magda Florea

Abstract

Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.

IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

Abstract

Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises videos ( frames at fps) with human-verified, per-instance segmentation masks. A -frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's , AP, precision, recall, and mask IoU. A further -frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with dense frames and a mean instance scale of px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.

Paper Structure

This paper contains 18 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Representative frames from the four dataset scenes (left to right): ACS-EC ground-level view showing dense seating and circulation areas; ACS-EC elevated view providing a top-down perspective of the same atrium; ACS-EG narrow corridor with strong near-to-distal scale variance; IE-Central entrance hall captured from an elevated angle; and R-Central central atrium with prominent structural columns and overhead viewpoint.
  • Figure 2: Spatial density heatmaps showing the normalised distribution of person bounding-box centres across all annotated frames per scene. Colour encodes relative density (yellow $\to$ dark red = low $\to$ high). The heatmaps reveal four distinct spatial regimes: a dominant horizontal band in ACS-EC driven by circulation traffic and stationary occupants in the common area; a concentrated mid-depth cluster in ACS-EG reflecting strong scale variance along its linear corridor axis; three discrete zones in IE-Central spanning the entry, corridor junction, and seating area; and a diffuse, column-interrupted spread in R-Central where the overhead viewpoint collapses ascending and descending pedestrian flow into a single projection. These spatial patterns directly inform the per-scene variation in crowd density, occlusion rate, and detection difficulty reported in Sections \ref{['subsec:occlusion']} and \ref{['subsec:autolabel_quality']}.
  • Figure 3: Qualitative comparison of auto-labelling methods across ACS-EC, IE-Central, and R-Central (rows, top to bottom). Columns show the raw image, SAM3, GroundingSAM, and human ground truth, with per-frame instance counts (n). SAM3 produces false positives on ACS-EC (row 1); GroundingSAM misses occluded persons across all scenes.