Table of Contents
Fetching ...

Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation

Tzu-Jung Lin, Jia-Fong Yeh, Hung-Ting Su, Chung-Yi Lin, Yi-Ting Chen, Winston H. Hsu

TL;DR

OVMM tasks hinge on selecting a base placement that balances task semantics with geometric feasibility under limited perception. The authors present Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework that fuses vision-language model cues with geometric constraints via cross-modal representations (Affordance RGB and Obstacle Map+) and an iterative, semantic-first then geometry-driven optimization. A coarse-to-fine procedure, governed by a sigmoid schedule for semantic-geometry weighting and VLM-based ranking, guides the search toward executable, semantically aligned placements. In simulation across five tasks, the method achieves 85% success, outperforming classical geometric planners and purely semantic baselines, demonstrating the value of affordance-aware, multimodal reasoning for instruction-conditioned base placement in OVMM.

Abstract

In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.

Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation

TL;DR

OVMM tasks hinge on selecting a base placement that balances task semantics with geometric feasibility under limited perception. The authors present Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework that fuses vision-language model cues with geometric constraints via cross-modal representations (Affordance RGB and Obstacle Map+) and an iterative, semantic-first then geometry-driven optimization. A coarse-to-fine procedure, governed by a sigmoid schedule for semantic-geometry weighting and VLM-based ranking, guides the search toward executable, semantically aligned placements. In simulation across five tasks, the method achieves 85% success, outperforming classical geometric planners and purely semantic baselines, demonstrating the value of affordance-aware, multimodal reasoning for instruction-conditioned base placement in OVMM.

Abstract

In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.

Paper Structure

This paper contains 30 sections, 9 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Examples of failure cases caused by base placements without affordance awareness. Left: The robot cannot open the cabinet because it is not facing the drawer. Middle: The robot cannot grasp the pot handle due to misalignment. Right: The robot fails to place the mug on the shelf as it is not facing the open side. These failures arise from a lack of joint reasoning over task intent and geometric feasibility, leading to semantically misaligned placements that prevent successful manipulation.
  • Figure 2: Affordance-Guided Coarse-to-Fine Exploration. The method comprises two key components. (1) To overcome the limitations of single-view perception, it applies Affordance Guidance Projection, which uses semantic cues to generate Affordance RGB and Obstacle Map+ from RGB and obstacle maps, enabling global semantic reasoning. (2) To identify base placements that satisfy both semantic relevance and geometric feasibility, it introduces Affordance-Driven Coarse-to-Fine Optimization, which leverages the coarse, high-level nature of VLM outputs to explore semantically appropriate regions. As the process iterates, geometric constraints are gradually emphasized, guiding the search toward executable base placements.
  • Figure 3: Base placement distribution evolution for the task "Open the cabinet." The A*/RRT* baseline (top row) selects a base placement at an oblique angle in front of the cabinet, which is not ideal. The Pivot baseline (second row) selects a region in front of the cabinet but fails due to excessive distance from the target. Our method (bottom row) converges to a base placement that is both feasible and semantically appropriate.
  • Figure 4: Projection Module Ablation. The full method achieves 85% success. Removing the 12 arrows causes a small drop (80%), while removing the main arrow “A” leads to a larger drop (62%). Without projection, performance drops most significantly to 48%.
  • Figure 5: Affordance guidance modules. (a) Coarse affordance direction selection for the Affordance Guidance Projection. (b) Affordance point selection for the Affordance-Driven Coarse-to-Fine Optimization.
  • ...and 7 more figures