Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation
Tzu-Jung Lin, Jia-Fong Yeh, Hung-Ting Su, Chung-Yi Lin, Yi-Ting Chen, Winston H. Hsu
TL;DR
OVMM tasks hinge on selecting a base placement that balances task semantics with geometric feasibility under limited perception. The authors present Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework that fuses vision-language model cues with geometric constraints via cross-modal representations (Affordance RGB and Obstacle Map+) and an iterative, semantic-first then geometry-driven optimization. A coarse-to-fine procedure, governed by a sigmoid schedule for semantic-geometry weighting and VLM-based ranking, guides the search toward executable, semantically aligned placements. In simulation across five tasks, the method achieves 85% success, outperforming classical geometric planners and purely semantic baselines, demonstrating the value of affordance-aware, multimodal reasoning for instruction-conditioned base placement in OVMM.
Abstract
In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.
