Table of Contents
Fetching ...

Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions

Hengxuan Xu, Fengbo Lan, Zhixin Zhao, Shengjie Wang, Mengqiao Liu, Jieqian Sun, Yu Cheng, Tao Zhang

TL;DR

This work tackles the challenge of executing tasks from ambiguous human instructions in unfamiliar environments by introducing AIDE, a dual-stream framework that couples MSI-based planning with ADM-based real-time execution. Through a multimodal chain-of-thought module, an affordance-informed Instruction-Tool Relationship Space, an Efficient Retrieval Scheme, and a proactive Exploration Policy, AIDE achieves zero-shot task planning with high accuracy and real-time closed-loop performance. Key findings show superior task planning accuracy (>80%) and near 10 Hz closed-loop execution (>95% accuracy on valid frames) across simulation and real-world tests, along with robust exploration capabilities in tool-absent scenarios. The approach offers practical potential for open-world robotics by reducing hallucinations and enabling interactive environment/tool grounding for ambiguous instructions.

Abstract

Enabling robots to explore and act in unfamiliar environments under ambiguous human instructions by interactively identifying task-relevant objects (e.g., identifying cups or beverages for "I'm thirsty") remains challenging for existing vision-language model (VLM)-based methods. This challenge stems from inefficient reasoning and the lack of environmental interaction, which hinder real-time task planning and execution. To address this, We propose Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions (AIDE), a dual-stream framework that integrates interactive exploration with vision-language reasoning, where Multi-Stage Inference (MSI) serves as the decision-making stream and Accelerated Decision-Making (ADM) as the execution stream, enabling zero-shot affordance analysis and interpretation of ambiguous instructions. Extensive experiments in simulation and real-world environments show that AIDE achieves the task planning success rate of over 80\% and more than 95\% accuracy in closed-loop continuous execution at 10 Hz, outperforming existing VLM-based methods in diverse open-world scenarios.

Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions

TL;DR

This work tackles the challenge of executing tasks from ambiguous human instructions in unfamiliar environments by introducing AIDE, a dual-stream framework that couples MSI-based planning with ADM-based real-time execution. Through a multimodal chain-of-thought module, an affordance-informed Instruction-Tool Relationship Space, an Efficient Retrieval Scheme, and a proactive Exploration Policy, AIDE achieves zero-shot task planning with high accuracy and real-time closed-loop performance. Key findings show superior task planning accuracy (>80%) and near 10 Hz closed-loop execution (>95% accuracy on valid frames) across simulation and real-world tests, along with robust exploration capabilities in tool-absent scenarios. The approach offers practical potential for open-world robotics by reducing hallucinations and enabling interactive environment/tool grounding for ambiguous instructions.

Abstract

Enabling robots to explore and act in unfamiliar environments under ambiguous human instructions by interactively identifying task-relevant objects (e.g., identifying cups or beverages for "I'm thirsty") remains challenging for existing vision-language model (VLM)-based methods. This challenge stems from inefficient reasoning and the lack of environmental interaction, which hinder real-time task planning and execution. To address this, We propose Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions (AIDE), a dual-stream framework that integrates interactive exploration with vision-language reasoning, where Multi-Stage Inference (MSI) serves as the decision-making stream and Accelerated Decision-Making (ADM) as the execution stream, enabling zero-shot affordance analysis and interpretation of ambiguous instructions. Extensive experiments in simulation and real-world environments show that AIDE achieves the task planning success rate of over 80\% and more than 95\% accuracy in closed-loop continuous execution at 10 Hz, outperforming existing VLM-based methods in diverse open-world scenarios.
Paper Structure (33 sections, 10 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Method objective. Given ambiguous instruction and unfamiliar scene as the input task, the ultimate objective of the AIDE method is to find the required tool and locate its operational and functional regions within the scene, enabling the robot to operate the tool accurately and complete the task.
  • Figure 2: Framework of the AIDE. Given the input task, the Multi-Stage Inference (MSI) stream uses Multimodal CoT Module (MM-CoT) and Exploration Policy to generate keyframe-based task planning result. By scoring the input instruction based on affordance, this result is projected into the Instruction-Tool Relationship Space, enabling a sufficient cross-modal understanding of the input instruction and robustness to hallucinations arising from GPT-5 reasoning. Building on this projection, the Accelerated Decision-Making (ADM) stream employs Efficient Retrieval Scheme (ERS) and Exploration Policy for real-time, closed-loop task execution over continuous frames.
  • Figure 3: Affordance vector distributions for instructions and tools after t-SNE projection into the three-dimensional space. The image shows that taking cutting/cleaning/drinking/heating-related tasks as representative examples, the affordance vector distributions for instructions and tools largely overlap.
  • Figure 4: Examples of real-world experiment results. Top left and bottom middle: Identified handle (blue dot) and body (red dot) regions of the target tool. Bottom left: Visible exploration region (blue box). Top middle: Invisible exploration region (green box). Top right and bottom right: tool region (red box).
  • Figure 5: The robot used to conduct the real-world experiment has 2 wheels, a 6 DOF arm, a gripper, and a RealSense camera.
  • ...and 5 more figures