Table of Contents
Fetching ...

SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR

SNOW introduces a training-free, backbone-agnostic framework that unifies open-world semantic priors with precise 3D geometry and temporal dynamics to build a persistent 4D Scene Graph (4DSG). By clustering point clouds with HDBSCAN, prompting SAM2 segmentation, and encoding object regions with Spatio-Temporal Tokenized Patch Encoding (STEP), SNOW produces object-centered tokens that fuse semantic, geometric, and temporal information and are anchored by SLAM for stable 4D grounding. The 4DSG serves as a rich, queryable prior for vision-language models, enabling grounded reasoning over space and time without retraining. Across NuScenes-QA, RoboSpatial-Home, VLM4D, and LiDAR segmentation tasks, SNOW achieves state-of-the-art or competitive performance in zero-shot settings, demonstrating the practical impact of structured 4D priors for open-world embodied robotics.

Abstract

Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

TL;DR

SNOW introduces a training-free, backbone-agnostic framework that unifies open-world semantic priors with precise 3D geometry and temporal dynamics to build a persistent 4D Scene Graph (4DSG). By clustering point clouds with HDBSCAN, prompting SAM2 segmentation, and encoding object regions with Spatio-Temporal Tokenized Patch Encoding (STEP), SNOW produces object-centered tokens that fuse semantic, geometric, and temporal information and are anchored by SLAM for stable 4D grounding. The 4DSG serves as a rich, queryable prior for vision-language models, enabling grounded reasoning over space and time without retraining. Across NuScenes-QA, RoboSpatial-Home, VLM4D, and LiDAR segmentation tasks, SNOW achieves state-of-the-art or competitive performance in zero-shot settings, demonstrating the practical impact of structured 4D priors for open-world embodied robotics.

Abstract

Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

Paper Structure

This paper contains 27 sections, 9 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of SNOW. SNOW builds a unified 4D Scene Graph (4DSG) by merging VLM semantics with 3D geometry and temporal continuity. STEP tokens encode object-level semantic, spatial, and temporal attributes into a persistent representation that enables grounded reasoning across diverse 4D benchmarks without additional training.
  • Figure 2: High-level pipeline of SNOW. The method clusters point clouds, samples representative points, and employs them as point prompts for SAM2-based segmentation. The resulting STEP tokens form a unified spatio-temporal scene graph (i.e., 4DSG), which serves as a persistent 4D world model, queryable by VLMs.
  • Figure 3: STEP token assignment process. Masks with at least 50% IoU containment retain their image tokens, which are enriched with 3D centroid, Gaussian shape, and extent tokens, as well as two temporal appearance and disappearance tokens. The resulting STEP tokens are assembled into a 4DSG, serving as SNOW's persistent 4D prior.
  • Figure 4: Qualitative examples of SNOW on RoboSpatial-Home and open-vocabulary LiDAR segmentation. For RoboSpatial-Home, red denotes the model prediction; blue denotes the ground truth reference.
  • Figure 5: Qualitative examples of SNOW on RoboSpatial-Home illustrating correct predictions (top row), ambiguous cases (middle row), and failure modes (bottom row). Red denotes the model prediction; Blue denotes the ground truth reference.
  • ...and 3 more figures