Table of Contents
Fetching ...

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

Can Zhang, Gim Hee Lee

TL;DR

IAAO tackles interactive affordance learning for articulated objects in 3D environments by building an explicit 3D Gaussian Splatting representation augmented with hierarchical semantic features from foundation models. It combines semantic scene reconstruction, language-guided affordance localization, and global/local motion estimation with robust 2D-3D correspondences, followed by scene state fusion to integrate two articulated configurations. The method achieves state-of-the-art performance on PARIS and multi-part benchmarks, with strong generalization to unseen objects and complex indoor scenes, while supporting manipulation through affordance-aware queries. This approach enables robust interaction and manipulation in real-world environments without relying on category-specific priors or perfectly aligned camera poses, significantly advancing interactive perception for robots and AR/VR agents.

Abstract

This work presents IAAO, a novel framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. Unlike prior methods that rely on task-specific networks and assumptions about movable parts, our IAAO leverages large foundation models to estimate interactive affordances and part articulations in three stages. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances. Finally, scenes from different states are merged and refined based on the estimated transformations, enabling robust affordance-based interaction and manipulation of objects. Experimental results demonstrate the effectiveness of our method.

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

TL;DR

IAAO tackles interactive affordance learning for articulated objects in 3D environments by building an explicit 3D Gaussian Splatting representation augmented with hierarchical semantic features from foundation models. It combines semantic scene reconstruction, language-guided affordance localization, and global/local motion estimation with robust 2D-3D correspondences, followed by scene state fusion to integrate two articulated configurations. The method achieves state-of-the-art performance on PARIS and multi-part benchmarks, with strong generalization to unseen objects and complex indoor scenes, while supporting manipulation through affordance-aware queries. This approach enables robust interaction and manipulation in real-world environments without relying on category-specific priors or perfectly aligned camera poses, significantly advancing interactive perception for robots and AR/VR agents.

Abstract

This work presents IAAO, a novel framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. Unlike prior methods that rely on task-specific networks and assumptions about movable parts, our IAAO leverages large foundation models to estimate interactive affordances and part articulations in three stages. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances. Finally, scenes from different states are merged and refined based on the estimated transformations, enabling robust affordance-based interaction and manipulation of objects. Experimental results demonstrate the effectiveness of our method.

Paper Structure

This paper contains 17 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Our IAAO requires multi-view images of the object or indoor scene from two different joint states (Left). The output is a 3D interactive field which supports interactions with multiple movable parts for fine-grained segmentation (e.g. Case 2: handles) and articulation reconstruction (e.g. Case 1: two articulated doors).
  • Figure 2: Our IAAO framework. 1) Top: Constructing 3D Gaussian fields in each state. We optimize 3DGS fields with hierarchical mask features, DINOv2 features and 3D-consistent mask labels generated from multi-view images. We also incorporate geometry information from depth images into the 3D Gaussians. 2) Bottom: Affordance and motion prediction. A query prompt is embedded using a pretrained encoder to localize relevant regions in the 3D Gaussians. For motion prediction, we optimize the transformation parameters by applying consistency and matching losses to 2D-3D correspondences between states. 3) Right: Scene fusion. Using the estimated transformations, we merge reconstructed 3DGS models from both states, aligning static and articulated elements.
  • Figure 3: Qualitative analysis of shape reconstruction, part segmentation, and joint prediction results on multi-part object dataset.
  • Figure 4: Qualitative results of shape reconstruction, part segmentation, and joint prediction on PARIS.
  • Figure 5: Motion snapshots on PARIS & multi-part object.
  • ...and 7 more figures