Table of Contents
Fetching ...

Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning

Yi Wang, Zeyu Xue, Mujie Liu, Tongqin Zhang, Yan Hu, Zhou Zhao, Chenguang Yang, Zhenyu Lu

TL;DR

This work addresses teleoperation under non-negligible network latency by introducing ST-OVSG, an open-vocabulary spatio-temporal scene graph with per-frame latency tags that align operator perspectives with remote observations. It combines open-vocabulary object/relationship reasoning with temporal linking via Hungarian matching and a lightweight latency tag to ground commands in the correct historical scene state. The approach enables task-focused subgraph extraction for LVLM planners, achieving 74% node accuracy on static Replica data and 70.5% planning success in latency-robust experiments, while generalizing to novel categories without fine-tuning. Overall, ST-OVSG enhances robustness to transmission delays and supports robust, open-vocabulary planning in dynamic teleoperation contexts, with potential for end-to-end integration and broader deployment.

Abstract

Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local-remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming ConceptGraph. Notably, in the latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5 percent.

Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning

TL;DR

This work addresses teleoperation under non-negligible network latency by introducing ST-OVSG, an open-vocabulary spatio-temporal scene graph with per-frame latency tags that align operator perspectives with remote observations. It combines open-vocabulary object/relationship reasoning with temporal linking via Hungarian matching and a lightweight latency tag to ground commands in the correct historical scene state. The approach enables task-focused subgraph extraction for LVLM planners, achieving 74% node accuracy on static Replica data and 70.5% planning success in latency-robust experiments, while generalizing to novel categories without fine-tuning. Overall, ST-OVSG enhances robustness to transmission delays and supports robust, open-vocabulary planning in dynamic teleoperation contexts, with potential for end-to-end integration and broader deployment.

Abstract

Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local-remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming ConceptGraph. Notably, in the latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5 percent.

Paper Structure

This paper contains 19 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: System overview. Based on the $t_{n-1}+\Delta{t}$ moment scene feedback, the local operator issues natural-language commands. These commands are sent over the data network to the remote side, where ST-OVSG temporally aligns the local commands with the remote observations to compensate for link latency. This alignment stabilizes the large model’s semantic reasoning and drives reliable execution by the robotic arm.
  • Figure 2: ST-OVSG builds a spatio-temporal open-vocabulary scene graph from RGB-D video sequences. Objects are detected and segmented from RGB frames, fused with depth to form semantic nodes. These graphs are linked across frames using the Hungarian algorithmHungarian1Hungarian2, producing a 4D scene graph with spatial and temporal edges and latency tags. User commands are used to query node features, filtering relevant nodes to form an ST-OVSG subgraph, which is then serialized into JSON and provided to the LVLM planner for generating executable robot task plans.
  • Figure 3: Execution process of the proposed method in a task. Left: users provide a natural-language grasp-and-place instruction at the local side (issue at 5.5s and communication latency is 500ms). ST-OVSG builds a time-aware, open-vocabulary scene graph, and based on this, the LVLM generates a latency-aware grasp-and-place plan at the remote side. Right: the robot executes the plan in sequence: it approaches the target, performs a stable grasp, transports the object smoothly, and places it safely at the designated location.