Table of Contents
Fetching ...

Scene-agnostic Hierarchical Bimanual Task Planning via Visual Affordance Reasoning

Kwang Bin Lee, Jiho Kang, Sung-Hee Lee

TL;DR

The paper tackles the challenge of translating open-world, high-level instructions into coordinated two-handed manipulation. It introduces a three-module pipeline—Visual Point Grounding (VPG) for scene grounding, Bimanual Subgoal Planner (BSP) for subgoal structure and merging, and Interaction-Point--Driven Bimanual Prompting (IPBP) for instantiating synchronized two-handed actions—augmented by Retrieval-Augmented Skill Generation (Skill RAG) for skill selection. By grounding object and interaction points in 3D and reasoning about reachability and affordances, the method produces semantically meaningful, physically feasible, and parallelizable bimanual plans that generalize to unseen cluttered scenes without retraining. Ablation studies demonstrate each module’s critical role, and Unity-based experiments report high success and compact plan lengths, indicating robust scene-agnostic affordance reasoning for bimanual tasks.

Abstract

Embodied agents operating in open environments must translate high-level instructions into grounded, executable behaviors, often requiring coordinated use of both hands. While recent foundation models offer strong semantic reasoning, existing robotic task planners remain predominantly unimanual and fail to address the spatial, geometric, and coordination challenges inherent to bimanual manipulation in scene-agnostic settings. We present a unified framework for scene-agnostic bimanual task planning that bridges high-level reasoning with 3D-grounded two-handed execution. Our approach integrates three key modules. Visual Point Grounding (VPG) analyzes a single scene image to detect relevant objects and generate world-aligned interaction points. Bimanual Subgoal Planner (BSP) reasons over spatial adjacency and cross-object accessibility to produce compact, motion-neutralized subgoals that exploit opportunities for coordinated two-handed actions. Interaction-Point-Driven Bimanual Prompting (IPBP) binds these subgoals to a structured skill library, instantiating synchronized unimanual or bimanual action sequences that satisfy hand-state and affordance constraints. Together, these modules enable agents to plan semantically meaningful, physically feasible, and parallelizable two-handed behaviors in cluttered, previously unseen scenes. Experiments show that it produces coherent, feasible, and compact two-handed plans, and generalizes to cluttered scenes without retraining, demonstrating robust scene-agnostic affordance reasoning for bimanual tasks.

Scene-agnostic Hierarchical Bimanual Task Planning via Visual Affordance Reasoning

TL;DR

The paper tackles the challenge of translating open-world, high-level instructions into coordinated two-handed manipulation. It introduces a three-module pipeline—Visual Point Grounding (VPG) for scene grounding, Bimanual Subgoal Planner (BSP) for subgoal structure and merging, and Interaction-Point--Driven Bimanual Prompting (IPBP) for instantiating synchronized two-handed actions—augmented by Retrieval-Augmented Skill Generation (Skill RAG) for skill selection. By grounding object and interaction points in 3D and reasoning about reachability and affordances, the method produces semantically meaningful, physically feasible, and parallelizable bimanual plans that generalize to unseen cluttered scenes without retraining. Ablation studies demonstrate each module’s critical role, and Unity-based experiments report high success and compact plan lengths, indicating robust scene-agnostic affordance reasoning for bimanual tasks.

Abstract

Embodied agents operating in open environments must translate high-level instructions into grounded, executable behaviors, often requiring coordinated use of both hands. While recent foundation models offer strong semantic reasoning, existing robotic task planners remain predominantly unimanual and fail to address the spatial, geometric, and coordination challenges inherent to bimanual manipulation in scene-agnostic settings. We present a unified framework for scene-agnostic bimanual task planning that bridges high-level reasoning with 3D-grounded two-handed execution. Our approach integrates three key modules. Visual Point Grounding (VPG) analyzes a single scene image to detect relevant objects and generate world-aligned interaction points. Bimanual Subgoal Planner (BSP) reasons over spatial adjacency and cross-object accessibility to produce compact, motion-neutralized subgoals that exploit opportunities for coordinated two-handed actions. Interaction-Point-Driven Bimanual Prompting (IPBP) binds these subgoals to a structured skill library, instantiating synchronized unimanual or bimanual action sequences that satisfy hand-state and affordance constraints. Together, these modules enable agents to plan semantically meaningful, physically feasible, and parallelizable two-handed behaviors in cluttered, previously unseen scenes. Experiments show that it produces coherent, feasible, and compact two-handed plans, and generalizes to cluttered scenes without retraining, demonstrating robust scene-agnostic affordance reasoning for bimanual tasks.

Paper Structure

This paper contains 30 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the Visual Point Grounding (VPG) system. A. Object Point Generation: The system analyzes a global RGB overview image using an image–text prompt to identify instruction-relevant objects. Grounded-SAM marks each object in 2D, and these marks are lifted into 3D to create object points with associated labels. B. Interaction Point Generation: For each object point, a close-up local RGB view is processed with targeted queries (such as “handle,” “knob,” or “graspable”) to locate manipulable parts. Grounded-SAM and a Set-of-Marks prompt extract these part-level regions and project them into 3D as interaction points with descriptive attributes. Scene Augmentation and Clustering: The original scene is then augmented with all object and interaction points, and an adjacency graph is constructed by clustering nearby object points to capture local spatial relationships for downstream planning.
  • Figure 2: Overview of the Point-Driven Bimanual Planning system, composed of the Bimanual Subgoal Planner (BSP) and the Interaction-Point–Driven Bimanual Prompting module (IPBP). A. Bimanual Subgoal Planner: Using the user's task command together with the adjacency graph produced by VPG, BSP selects task-relevant object regions and generates a sequence of bimanual subgoals, refined through skill-name matching and retrieval from the skill knowledge base. B. Object-Point Navigation: The agent walks through the pre-processed scene produced by VPG, navigates to each subgoal’s object point using A* pathfinding, and samples the nearest interaction points visible from that location. C. Interaction-Point–Driven Bimanual Prompting: IPBP combines the refined skill, the current subgoal, sampled interaction points, and the agent’s hand state to produce synchronized bimanual action tuples guided by retrieved coordination patterns. D. Scene Update: Executed action tuples update the hand state, object interactions, and remaining subgoals, enabling iterative execution of the full bimanual manipulation sequence.
  • Figure 3: Workflow of point-driven bimanual task planning. Given a high-level user command (heating a lunch box) and a VPG-processed scene (convenience store), the Bimanual Subgoal Planner (BSP) forms object-point–level subgoals and assigns the most suitable canonical manipulation skills. Each subgoal is then converted into grounded bimanual tuples by the Interaction Point Driven Bimanual Prompting module (IPBP), which binds manipulation templates to the retrieved interaction points and current hand states. The resulting action sequence is executed in Unity, where the agent navigates to each object point and performs the required two-handed interactions, producing coherent, feasible, and visually grounded bimanual behavior.
  • Figure 4: Overview of bimanual action tuples generated by the Interaction-Point--Driven Bimanual Prompting system across diverse scene--task pairs. Each example shows a synchronized left--right action grounded in local interaction-point identifiers, demonstrating consistent, affordance-aligned bimanual behavior despite variations in scene layout and object configuration.