Table of Contents
Fetching ...

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

Qing Zhang, Xuesong Li, Jing Zhang

TL;DR

This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

Abstract

What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

TL;DR

This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

Abstract

What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.
Paper Structure (10 sections, 12 figures, 2 tables)

This paper contains 10 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: From geometry to interaction: uncovering affordance perception in visual foundation models. Affordance understanding emerges from the fusion of geometric structure and generative interaction priors. (a) Models with stronger geometric awareness yield richer part-level representations. (b) Generative models reveal verb-conditioned attention that naturally localizes interaction regions without supervision.
  • Figure 2: Learning Affordances: From Geometry to Interaction. Fully-supervised methods learn geometric structures from pixel masks; weakly-supervised ones infer interactions from human-object imagery.
  • Figure 3: Distinct geometric representations across Visual Foundation Models. We project complex scenes into the PCA subspace of a reference object (a mug) to visualize and compare geometric representations. DINOv3 produces part-level structures that generalize across scenes and materials, as confirmed by cosine similarity responses from a mug-handle patch, while CLIP, SAM, and Stable Diffusion emphasize semantics, edges, or smooth surface continuity instead.
  • Figure 4: Geometric awareness predicts affordance perception ability. We evaluate visual models under different supervision paradigms using linear probing on the UMD dataset myers2015affordancedo2018affordancenet. Models with stronger geometric awareness el2024probing—achieve higher mIoU in affordance segmentation, while others improve notably when augmented with depth or normal cues.
  • Figure 5: Stable part-level structure in DINO representations. PCA of DINOv3 embeddings reveals consistent part-wise activations across the first three components, indicating interpretable geometric decomposition aligned with object functionality.
  • ...and 7 more figures