Table of Contents
Fetching ...

Spatially Grounded Long-Horizon Task Planning in the Wild

Sehun Jung, HyunJee Song, Dong-Hee Kim, Reuben Tan, Jianfeng Gao, Yong Jae Lee, Donghyun Kim

Abstract

Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.

Spatially Grounded Long-Horizon Task Planning in the Wild

Abstract

Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.
Paper Structure (9 sections, 1 equation, 7 figures, 2 tables)

This paper contains 9 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Motivation of grounded planning. (a) VLM-as-Planner decomposes high-level instructions into natural language sub-actions, which are then grounded by separate perception modules. However, the lack of explicit spatial specification often leads to ambiguous action grounding. (b) Our GroundedPlanBench jointly annotates and evaluates hierarchical sub-action planning and spatial grounding under both explicit and implicit instructions.
  • Figure 2: Task distribution and instruction types in GroundedPlanBench.
  • Figure 3: An example of a simplified instruction prompt for spatially grounded planning.
  • Figure 4: V2GP is a training data generation framework designed to enhance spatially grounded sub-action plans from real-world robot demonstration videos through the stages: (1) Temporal sub-action decomposition, where gripper state signals segment demonstrations into sub-action units; (2) Interactive object identification, where a VLM analyzes each segment to identify the actively manipulated objects; (3) Spatial grounding of actions, where SAM3 localizes target objects and placement endpoints using bounding boxes and points; and (4) Spatially grounded task planning, which integrates the grounded sub-action primitives with explicit and implicit task instructions and spatial grounding. The collected data are used to fine-tune VLM-as-Planners, enhancing both hierarchical task planning (what to do) and spatial grounding (where to act) for long-horizon tasks.
  • Figure 5: Visualization of decoupled task planning and spatial grounding, where underlined actions are incorrectly grounded due to semantically similar objects (e.g., identical napkins). In contrast, V2GP finds the correct correspondence between each action and spatial grounding, enabling sequentially consistent and accurate execution.
  • ...and 2 more figures