Table of Contents
Fetching ...

Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

Yu-Hong Shen, Chuan-Yu Wu, Yi-Ru Yang, Yen-Ling Tai, Yi-Ting Chen

TL;DR

This work tackles robust, instruction-following manipulation by Multimodal Large Language Models (MLLMs) through the QuARC benchmark, which jointly evaluates quantity estimation, reachability, relative positioning, and collision avoidance. It identifies cross-modal distraction and geometric infeasibility as key barriers to reliable closed-loop planning and mitigates them with a Chain-of-Thought approach augmented by Self-Consistency and a skill-affordance predictor that enforces action feasibility without finetuning. The proposed method demonstrates substantial gains, achieving a 76.7% success rate on QuARC and outperforming the ViLa baseline (36.7%) across multiple categories. These results highlight the potential of structured reasoning and feasibility-guided planning for practical, in-context manipulation tasks, while also outlining avenues for future improvements such as richer policies and specialized perception-reasoning hybrids.

Abstract

We investigate the use of Multimodal Large Language Models (MLLMs) with in-context learning for closed-loop task planning in instruction-following manipulation. We identify four essential requirements for successful task planning: quantity estimation, reachability analysis, relative positioning, and collision avoidance. However, existing benchmarks fail to support holistic evaluation across all these aspects. To address this gap, we introduce \textbf{QuARC} (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario that integrates all four challenges. Using QuARC, we reveal two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility. To tackle these, we adapt Chain-of-Thought with Self-Consistency to mitigate reasoning loss from cross-modal distractions and incorporate an affordance predictor to guide planning based on geometric feasibility. Our comprehensive evaluation analyzes performance across multiple baselines and explains sources of improvement. Our method achieves a 76.7\% success rate on the benchmark, significantly outperforming the ViLa baseline (36.7\%), without requiring additional finetuning. Code and dataset are available at https://hcis-lab.github.io/Affordance-Guided-Self-Consistent-MLLM.

Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

TL;DR

This work tackles robust, instruction-following manipulation by Multimodal Large Language Models (MLLMs) through the QuARC benchmark, which jointly evaluates quantity estimation, reachability, relative positioning, and collision avoidance. It identifies cross-modal distraction and geometric infeasibility as key barriers to reliable closed-loop planning and mitigates them with a Chain-of-Thought approach augmented by Self-Consistency and a skill-affordance predictor that enforces action feasibility without finetuning. The proposed method demonstrates substantial gains, achieving a 76.7% success rate on QuARC and outperforming the ViLa baseline (36.7%) across multiple categories. These results highlight the potential of structured reasoning and feasibility-guided planning for practical, in-context manipulation tasks, while also outlining avenues for future improvements such as richer policies and specialized perception-reasoning hybrids.

Abstract

We investigate the use of Multimodal Large Language Models (MLLMs) with in-context learning for closed-loop task planning in instruction-following manipulation. We identify four essential requirements for successful task planning: quantity estimation, reachability analysis, relative positioning, and collision avoidance. However, existing benchmarks fail to support holistic evaluation across all these aspects. To address this gap, we introduce \textbf{QuARC} (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario that integrates all four challenges. Using QuARC, we reveal two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility. To tackle these, we adapt Chain-of-Thought with Self-Consistency to mitigate reasoning loss from cross-modal distractions and incorporate an affordance predictor to guide planning based on geometric feasibility. Our comprehensive evaluation analyzes performance across multiple baselines and explains sources of improvement. Our method achieves a 76.7\% success rate on the benchmark, significantly outperforming the ViLa baseline (36.7\%), without requiring additional finetuning. Code and dataset are available at https://hcis-lab.github.io/Affordance-Guided-Self-Consistent-MLLM.

Paper Structure

This paper contains 22 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 2: Overview of our planning pipeline, consisting of the MLLM Planning Stage for generating a skill sequence, Self-Consistency Verification for stabilizing skill selection, and Skill Affordance for verifying geometric feasibility. This process loops until planner select a special termination skill DONE.
  • Figure :