More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks
Xinyu Shao, Yanzhe Tang, Pengwei Xie, Kaiwen Zhou, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Long Zeng, Xiu Li
TL;DR
RoboMAP addresses brittleness in language-guided spatial grounding by predicting dense adaptive affordance heatmaps $\hat{M}$ instead of discrete points. It combines a vision-language backbone with a two-branch Adaptive Heatmap Decoder and a procedural heatmap-synthesis pipeline to train $\hat{M}$ against a synthesized ground-truth $M^*$. Key contributions include the adaptive heatmap representation, a unified data-fusion approach via diverse supervision, and strong empirical results: state-of-the-art on multiple benchmarks, 82% real-world manipulation success, and fast inference enabling zero-shot generalization to navigation. This approach yields robust, interpretable grounding with practical real-time applicability across manipulation and navigation tasks.
Abstract
Many language-guided robotic systems rely on collapsing spatial reasoning into discrete points, making them brittle to perceptual noise and semantic ambiguity. To address this challenge, we propose RoboMAP, a framework that represents spatial targets as continuous, adaptive affordance heatmaps. This dense representation captures the uncertainty in spatial grounding and provides richer information for downstream policies, thereby significantly enhancing task success and interpretability. RoboMAP surpasses the previous state-of-the-art on a majority of grounding benchmarks with up to a 50x speed improvement, and achieves an 82\% success rate in real-world manipulation. Across extensive simulated and physical experiments, it demonstrates robust performance and shows strong zero-shot generalization to navigation. More details and videos can be found at https://robo-map.github.io.
