Table of Contents
Fetching ...

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

TL;DR

MomaGraph presents a unified, state-aware scene-graph representation that fuses spatial and functional relations with part-level interactive nodes to support embodied task planning. It introduces MomaGraph-Scenes for large-scale, task-driven annotations, MomaGraph-Bench for structured evaluation, and MomaGraph-R1, a 7B vision-language model trained with reinforcement learning to generate task-oriented graphs and plan in a Graph-then-Plan framework. The approach achieves state-of-the-art open-source performance on benchmarks and demonstrates robust generalization to unseen environments and real-robot experiments. By unifying relations and dynamically updating graphs as actions unfold, MomaGraph addresses limitations of single-relational, static scene graphs and enables more reliable, interpretable embodied reasoning and planning.

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

TL;DR

MomaGraph presents a unified, state-aware scene-graph representation that fuses spatial and functional relations with part-level interactive nodes to support embodied task planning. It introduces MomaGraph-Scenes for large-scale, task-driven annotations, MomaGraph-Bench for structured evaluation, and MomaGraph-R1, a 7B vision-language model trained with reinforcement learning to generate task-oriented graphs and plan in a Graph-then-Plan framework. The approach achieves state-of-the-art open-source performance on benchmarks and demonstrates robust generalization to unseen environments and real-robot experiments. By unifying relations and dynamically updating graphs as actions unfold, MomaGraph addresses limitations of single-relational, static scene graphs and enables more reliable, interpretable embodied reasoning and planning.

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

Paper Structure

This paper contains 36 sections, 4 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Overview of the MomaGraph. Given a task instruction, MomaGraph constructs a task-specific scene graph that highlights relevant objects and parts along with their spatial-functional relationships, enabling the robot to perform spatial understanding and task planning.
  • Figure 2: Direct planning often fails even for strong closed-source models like GPT-5, producing wrong actions or missing key steps, while our Graph-then-Plan approach with structured scene graphs enables accurate and complete task sequences aligned with ground truth.
  • Figure 3: MomaGraph captures state changes in the environment and dynamically updates the task-specific scene graph accordingly, enabling the graph to evolve as interactions occur and reflecting updated spatial–functional relationships.
  • Figure 4: Examples of evaluation Multi-Choices VQA tasks in the MomaGraph-Bench. We showcase example questions covering six core reasoning capabilities. Beyond these core capabilities, we further design tasks on Dynamic Verification and Long-horizon Task Decomposition to evaluate temporal reasoning and multi-steps planning.
  • Figure 5: Real Robot experiments on the RobotEra Q5 with a D455, demonstrating four household tasks that require spatial, functional, and part-level interactive elements reasoning for task execution.
  • ...and 14 more figures