Table of Contents
Fetching ...

More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

Xinyu Shao, Yanzhe Tang, Pengwei Xie, Kaiwen Zhou, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Long Zeng, Xiu Li

TL;DR

RoboMAP addresses brittleness in language-guided spatial grounding by predicting dense adaptive affordance heatmaps $\hat{M}$ instead of discrete points. It combines a vision-language backbone with a two-branch Adaptive Heatmap Decoder and a procedural heatmap-synthesis pipeline to train $\hat{M}$ against a synthesized ground-truth $M^*$. Key contributions include the adaptive heatmap representation, a unified data-fusion approach via diverse supervision, and strong empirical results: state-of-the-art on multiple benchmarks, 82% real-world manipulation success, and fast inference enabling zero-shot generalization to navigation. This approach yields robust, interpretable grounding with practical real-time applicability across manipulation and navigation tasks.

Abstract

Many language-guided robotic systems rely on collapsing spatial reasoning into discrete points, making them brittle to perceptual noise and semantic ambiguity. To address this challenge, we propose RoboMAP, a framework that represents spatial targets as continuous, adaptive affordance heatmaps. This dense representation captures the uncertainty in spatial grounding and provides richer information for downstream policies, thereby significantly enhancing task success and interpretability. RoboMAP surpasses the previous state-of-the-art on a majority of grounding benchmarks with up to a 50x speed improvement, and achieves an 82\% success rate in real-world manipulation. Across extensive simulated and physical experiments, it demonstrates robust performance and shows strong zero-shot generalization to navigation. More details and videos can be found at https://robo-map.github.io.

More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks

TL;DR

RoboMAP addresses brittleness in language-guided spatial grounding by predicting dense adaptive affordance heatmaps instead of discrete points. It combines a vision-language backbone with a two-branch Adaptive Heatmap Decoder and a procedural heatmap-synthesis pipeline to train against a synthesized ground-truth . Key contributions include the adaptive heatmap representation, a unified data-fusion approach via diverse supervision, and strong empirical results: state-of-the-art on multiple benchmarks, 82% real-world manipulation success, and fast inference enabling zero-shot generalization to navigation. This approach yields robust, interpretable grounding with practical real-time applicability across manipulation and navigation tasks.

Abstract

Many language-guided robotic systems rely on collapsing spatial reasoning into discrete points, making them brittle to perceptual noise and semantic ambiguity. To address this challenge, we propose RoboMAP, a framework that represents spatial targets as continuous, adaptive affordance heatmaps. This dense representation captures the uncertainty in spatial grounding and provides richer information for downstream policies, thereby significantly enhancing task success and interpretability. RoboMAP surpasses the previous state-of-the-art on a majority of grounding benchmarks with up to a 50x speed improvement, and achieves an 82\% success rate in real-world manipulation. Across extensive simulated and physical experiments, it demonstrates robust performance and shows strong zero-shot generalization to navigation. More details and videos can be found at https://robo-map.github.io.

Paper Structure

This paper contains 30 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Given an ambiguous instruction like Find the free space near the teal bowl, prior works (a) that rely on discrete points or bounding boxes fail to capture the complex, non-rectangular nature of the goal region. In contrast, our RoboMAP (b) generates a dense affordance heatmap that accurately represents the entire continuous distribution of suitable locations, successfully grounding the complex spatial concept.
  • Figure 2: Overall Architecture of the RoboMAP Framework.
  • Figure 3: The architecture of Adaptive Heatmap Decoder (AHD). A PaliGemma Backbone processes the image and instruction into visual tokens. These are reshaped into a language-conditioned feature grid $F_{\text{low}}$, which is then fed into our AHD. The AHD uses two branches (AKG and CAP) to compute the final adaptive affordance heatmap.
  • Figure 4: Qualitative comparison on challenging spatial grounding instructions. The visualizations highlight RoboMAP's ability to generate coherent heatmaps for ambiguous regions. This contrasts with the scattered points or diffuse heatmaps produced by baseline methods in the same scenarios.
  • Figure 5: Evaluation of Real-World Robotic Manipulation Tasks.
  • ...and 2 more figures