Table of Contents
Fetching ...

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong

TL;DR

Holi-Spatial is proposed, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline, and demonstrates exceptional performance in data curation quality.

Abstract

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

TL;DR

Holi-Spatial is proposed, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline, and demonstrates exceptional performance in data curation quality.

Abstract

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.
Paper Structure (19 sections, 2 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: We introduce Holi-Spatial, the first fully automated pipeline capable of converting raw video streams into holistic 3D spatial annotations without human intervention. Compared to state-of-the-art methods, Holi-Spatial achieves a significant leap in annotation quality, improving multi-view depth estimation by 0.5 F1 and boosting 3D detection AP${50}$ by a remarkable 64% on ScanNet dai2017scannet. Based on this, we introduce Holi-Spatial-4M, a large-scale dataset that effectively empowers Vision-Language Models. As shown, fine-tuning Qwen3-VL on Holi-Spatial-4M leads to state-of-the-art performance, with a 15% AP${50}$ gain on ScanNet++ yeshwanth2023scannet++ and a 7.9% accuracy rise on MMSI-Bench yang2025mmsi. Importantly, because the entire annotation pipeline is automatic, it can be further scaled up as resources permit.
  • Figure 2: Comparison of our refined annotations with the official annotations on the ScanNet dataset dai2017scannet. Our method achieves more accurate and sharper segmentation masks, as well as improved category recognition.
  • Figure 3: Overview of the Holi-Spatial data curation pipeline. The framework operates in three progressive stages: (1) Geometric Optimization distills high-fidelity 3D structure from video streams using 3DGS; (2) Image-level Perception lifts 2D VLM and SAM3 predictions into initial 3D proposals; and (3) Scene-level Refinement employs a coarse-to-fine strategy to merge, verify, and caption instances, yielding dense, high-quality spatial annotations. Finally, leveraging the generated Holi-Spatial-4M dataset, we directly fine-tune the Qwen-VL family for downstream tasks (e.g., 3D grounding and spatial reasoning).
  • Figure 4: Pipeline of 2D-to-3D OBB Generation. We transform 2D object masks into initial 3D OBBs via depth projection, utilizing a four-step strategy to mitigate the impact of depth floaters. (1) We obtain an initial object depth map by combining 3DGS rendering with SAM3 instance segmentation. (2) To mitigate 2D boundary errors from SAM3, we erode the object mask near its contour and keep only the reliable interior region. (3) To remove 3D outliers caused by depth discontinuities, we use a multi-view-consistent mesh depth as guidance and filter inconsistent pixels in the 3DGS depth. (4) Finally, we estimate the initial 3D OBB from the refined point cloud, while preserving the associated 2D mask, confidence score, and source image index.
  • Figure 5: Floor-aligned OBB post-processing pipeline. Starting from the input instance OBBs produced after 2D-to-3D lifting and initial OBB estimation, we (1) detect a floor (or fallback planar structure) to infer a global up-axis, (2) re-align each instance OBB under a yaw-lock constraint with optional PCA fallback and update extents, and (3) apply a validation check to output gravity-/floor-consistent OBBs for downstream scene-level refinement.
  • ...and 12 more figures