Table of Contents
Fetching ...

Panoramic Affordance Prediction

Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen, Zichen Wen, Litao Guo, Bin Ren, Xu Zheng, Yinchuan Li, Xuming Hu, Nicu Sebe, Ying-Cong Chen

Abstract

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

Panoramic Affordance Prediction

Abstract

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
Paper Structure (57 sections, 5 equations, 39 figures, 11 tables)

This paper contains 57 sections, 5 equations, 39 figures, 11 tables.

Figures (39)

  • Figure 1: Overview of our work.Left: We introduce PAP-12K, the first large-scale benchmark dedicated to panoramic affordance prediction, featuring ultra-high resolution (12K) imagery, rich reasoning-based QA pairs, and explicitly capturing unique panoramic challenges (geometric distortion, boundary discontinuity, and extreme scale variations). Right: We propose the PAP framework, which mimics human foveal vision to tackle these challenges. It employs Recursive Visual Routing for efficient coarse localization, an Adaptive Gaze mechanism to rectify spatial distortions, and Cascaded Affordance Grounding for precise instance-level mask extraction.
  • Figure 2: PAP-12K specifically features three challenges inherent to $360^{\circ}$ panoramic imagery and ERP: (1) Geometric Distortion (e.g., the bed and the elevator); (2) Extreme Scale Variations (e.g., the extremely small security camera and the extremely large curtain); (3) Boundary Discontinuity (e.g., the drying rod and the fire hose).
  • Figure 3: Statistics of the PAP-12K. (Left) Scene distribution with 1,003 high-resolution panoramic images; (Middle) Object distribution featuring 6,103 annotated object instances; (Right) Question distribution comprising 13,493 affordance questions.
  • Figure 4: Word Cloud of Questions in PAP-12K.
  • Figure 5: Illustration of the PAP framework.
  • ...and 34 more figures