Table of Contents
Fetching ...

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

Ji Ha Jang, Hoigi Seo, Se Young Chun

TL;DR

INTRA reframes weakly supervised affordance grounding as representation learning, enabling grounding from exocentric images alone while leveraging vision-language models and large language models to capture interaction relationships. The method introduces a text-conditioned affordance map that guides contrastive learning through an LLM-derived interaction-relationship map and text synonym augmentation, achieving state-of-the-art performance on AGD20K and strong generalization to IIT-AFF, CAD, and UMD. Key innovations include the interaction relationship-guided contrastive loss and object-variance mitigation loss, which together ground multiple affordances for a single object without paired exocentric-egocentric data. The approach demonstrates robustness across domain gaps and novel interactions, highlighting practical impact for scalable, zero-shot affordance grounding and flexible text-driven inference in real-world settings.

Abstract

Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

TL;DR

INTRA reframes weakly supervised affordance grounding as representation learning, enabling grounding from exocentric images alone while leveraging vision-language models and large language models to capture interaction relationships. The method introduces a text-conditioned affordance map that guides contrastive learning through an LLM-derived interaction-relationship map and text synonym augmentation, achieving state-of-the-art performance on AGD20K and strong generalization to IIT-AFF, CAD, and UMD. Key innovations include the interaction relationship-guided contrastive loss and object-variance mitigation loss, which together ground multiple affordances for a single object without paired exocentric-egocentric data. The approach demonstrates robustness across domain gaps and novel interactions, highlighting practical impact for scalable, zero-shot affordance grounding and flexible text-driven inference in real-world settings.

Abstract

Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.
Paper Structure (73 sections, 9 equations, 19 figures, 15 tables)

This paper contains 73 sections, 9 equations, 19 figures, 15 tables.

Figures (19)

  • Figure 1: Prior works on weakly-supervised affordance grounding like LOCATE li2023locate often failed to ground different affordances for the same object. However, our proposed INTRA yielded finer and more accurate grounding results for them that are closer to the ground truth (GT) by considering interaction relationship among them.
  • Figure 2: Overall frameworks of (a) LOCATE li2023locate and (b) INTRA (Ours). LOCATE takes paired exocentric and egocentric images to generate interaction-aware affordance maps (CAMs) for predefined interactions and then selects an interaction-related CAM by the given interaction label. In contrast, INTRA takes only exocentric images and interaction labels to yield an affordance map through our affordance map generation module. Training is done via interaction relationship-guided contrastive learning on exocentric features from affordance maps. Note that all encoder parameters are frozen.
  • Figure 3: The overall scheme of interaction-relationship map ($\mathcal{R}$) generation. LLM classifies all pairs of interactions in the dataset as positive or negative through chain of thoughts. This process is based on reasoning if interactions occur on same object parts.
  • Figure 4: Qualitative results of INTRA (Ours) and baseline models luo2022groundedluo2022learningli2023locate on grounding affordances of multiple potential interactions on a single object. INTRA precisely localizes relevant interaction spots for each interaction. For example, with a knife, it grounds the handle for 'Hold' and the blade for 'Cut with'. For a motorcycle, it accurately grounds the saddle for 'Sit on'. Additionally, for 'Ride', it grounds both the handle and saddle, slightly deviating from the GT but still producing reasonable results, as we usually interacts with handle and saddle to 'Ride' a motorcycle.
  • Figure 6: An illustration of interaction relationship-guided contrastive learning and t-SNE van2008visualizing visualization of feature distribution. (a) In interaction relationship-guided contrastive learning, positive interaction pairs attract each other, while others repel. (b) t-SNE visualization of DINOv2 oquab2023dinov2 class token and $f_{exo}$ from INTRA, showing that features of positive interaction pairs become closer as learning progresses.
  • ...and 14 more figures