More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

Xinyu Shao; Yanzhe Tang; Pengwei Xie; Kaiwen Zhou; Yuzheng Zhuang; Xingyue Quan; Jianye Hao; Long Zeng; Xiu Li

More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

Xinyu Shao, Yanzhe Tang, Pengwei Xie, Kaiwen Zhou, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Long Zeng, Xiu Li

TL;DR

RoboMAP addresses brittleness in language-guided spatial grounding by predicting dense adaptive affordance heatmaps $\hat{M}$ instead of discrete points. It combines a vision-language backbone with a two-branch Adaptive Heatmap Decoder and a procedural heatmap-synthesis pipeline to train $\hat{M}$ against a synthesized ground-truth $M^*$. Key contributions include the adaptive heatmap representation, a unified data-fusion approach via diverse supervision, and strong empirical results: state-of-the-art on multiple benchmarks, 82% real-world manipulation success, and fast inference enabling zero-shot generalization to navigation. This approach yields robust, interpretable grounding with practical real-time applicability across manipulation and navigation tasks.

Abstract

Many language-guided robotic systems rely on collapsing spatial reasoning into discrete points, making them brittle to perceptual noise and semantic ambiguity. To address this challenge, we propose RoboMAP, a framework that represents spatial targets as continuous, adaptive affordance heatmaps. This dense representation captures the uncertainty in spatial grounding and provides richer information for downstream policies, thereby significantly enhancing task success and interpretability. RoboMAP surpasses the previous state-of-the-art on a majority of grounding benchmarks with up to a 50x speed improvement, and achieves an 82\% success rate in real-world manipulation. Across extensive simulated and physical experiments, it demonstrates robust performance and shows strong zero-shot generalization to navigation. More details and videos can be found at https://robo-map.github.io.

More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

TL;DR

RoboMAP addresses brittleness in language-guided spatial grounding by predicting dense adaptive affordance heatmaps

instead of discrete points. It combines a vision-language backbone with a two-branch Adaptive Heatmap Decoder and a procedural heatmap-synthesis pipeline to train

against a synthesized ground-truth

. Key contributions include the adaptive heatmap representation, a unified data-fusion approach via diverse supervision, and strong empirical results: state-of-the-art on multiple benchmarks, 82% real-world manipulation success, and fast inference enabling zero-shot generalization to navigation. This approach yields robust, interpretable grounding with practical real-time applicability across manipulation and navigation tasks.

More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

TL;DR

Abstract

More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)