Table of Contents
Fetching ...

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

Yuzhi Huang, Jie Wu, Weijue Bu, Ziyi Xiong, Gaoyang Jiang, Ye Li, Kangye Ji, Shuzhao Xie, Yue Huang, Chenglei Wu, Jingyan Jiang, Zhi Wang

Abstract

Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and action consequences across interactions rather than reconstructing them at each instant. Inspired by this human capacity for causal spatio-temporal reasoning with persistent memory, we propose RoboStream, a training-free framework that achieves geometric anchoring through Spatio-Temporal Fusion Tokens (STF-Tokens), which bind visual evidence to 3D geometric attributes for persistent object grounding, and maintains causal continuity via a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions across steps. This design enables the planner to trace causal chains and preserve object permanence under occlusion without additional training or fine-tuning. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, where both SoFar and VoxPoser score 11.1%, demonstrating that spatio-temporal reasoning and causal memory are critical missing components for reliable long-horizon manipulation.

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

Abstract

Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and action consequences across interactions rather than reconstructing them at each instant. Inspired by this human capacity for causal spatio-temporal reasoning with persistent memory, we propose RoboStream, a training-free framework that achieves geometric anchoring through Spatio-Temporal Fusion Tokens (STF-Tokens), which bind visual evidence to 3D geometric attributes for persistent object grounding, and maintains causal continuity via a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions across steps. This design enables the planner to trace causal chains and preserve object permanence under occlusion without additional training or fine-tuning. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, where both SoFar and VoxPoser score 11.1%, demonstrating that spatio-temporal reasoning and causal memory are critical missing components for reliable long-horizon manipulation.
Paper Structure (37 sections, 6 equations, 18 figures, 10 tables)

This paper contains 37 sections, 6 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Qualitative comparison between SoFar qisofar and RoboStream. This long-horizon task involves object occlusion and state restoration. Top (SoFar): Diffuse attention heatmaps and absent memory logs reveal two cascading failures: Ungrounded Spatial Perception misplaces the blue cube at Step 2, while Untracked Causal History leaves the planner unaware of the occluded cube at Step 4, resulting in an incorrect final state. Bottom (RoboStream): Concentrated attention heatmaps and persistent memory logs show that Spatio-Temporal Grounding enables accurate object placement at Step 2, while Causal Memory Tracking maintains awareness of the occluded cube at Step 4, enabling successful task completion.
  • Figure 2: Overview of the RoboStream Framework. The system processes multi-modal RGB-D inputs through three hierarchical stages: (1) Object-Centric Perception and Spatio-Temporal Token Fusion, where raw visual data is distilled into STF-Tokens by grounding visual evidence within 3D geometric primitives (centroids and Gaussian shapes); (2) 4D Causal Spatio-Temporal Graph Construction, which maintains persistent memory encoding object identities and state transitions across time steps; and (3) VLM-based Reasoning and Planning, where the VLM leverages both the CSTG and visual inputs to perform high-level task planning and precise robotic manipulation.
  • Figure 3: Qualitative results in real-world and simulated manipulation tasks. We evaluate on three challenging tasks: (a) Block Building, demanding precise bottom-to-top assembly toward a target configuration; (b) Block Disassembly, requiring structured top-to-bottom deconstruction and rearrangement; and (c) Block Hide and Restore, requiring causal memory maintenance under full occlusion. RoboStream consistently outperforms SoFar qisofar and VoxPoser huang2023voxposer, particularly in spatio-temporal perception and causal memory over extended horizons.
  • Figure 4: Performance comparison on zero-shot real-world manipulation tasks. We design 21 diverse tasks covering block building (Task A), block disassembly (Task B), and block hide and restore (Task C).
  • Figure 5: Real-world experimental setup. (a) The Franka Research 3 robot arm used in our experiments. (b) The physical workspace equipped with the robot arm and an Intel RealSense D435i camera; the insets show the corresponding RGB view and depth map.
  • ...and 13 more figures