Table of Contents
Fetching ...

Φeat: Physically-Grounded Feature Representation

Giuseppe Vecchio, Adrien Kaiser, Rouffet Romain, Rosalie Martin, Elena Garces, Tamy Boubekeur

TL;DR

Φeat introduces a physically-grounded self-supervised backbone that learns material-aware features by replacing conventional photometric augmentations with renderings of the same material under diverse geometry and lighting. Built on a ViT backbone and a suite of losses (image-level DINO-like alignment, patch-level latent reconstruction, KoLeo dispersion, Gram anchoring, and in-batch contrastive learning), it encourages invariance to extrinsic factors while preserving intrinsic material cues. Quantitative and qualitative results demonstrate superior material discrimination, robust clustering by material identity under illumination and geometry changes, and coherent patch-level segmentations that reflect reflectance and texture rather than semantic object parts. The work highlights the potential of unsupervised physical feature learning to support physics-aware perception tasks in vision and graphics, offering a scalable path beyond supervised material annotation.

Abstract

Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Φ$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Φ$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.

Φeat: Physically-Grounded Feature Representation

TL;DR

Φeat introduces a physically-grounded self-supervised backbone that learns material-aware features by replacing conventional photometric augmentations with renderings of the same material under diverse geometry and lighting. Built on a ViT backbone and a suite of losses (image-level DINO-like alignment, patch-level latent reconstruction, KoLeo dispersion, Gram anchoring, and in-batch contrastive learning), it encourages invariance to extrinsic factors while preserving intrinsic material cues. Quantitative and qualitative results demonstrate superior material discrimination, robust clustering by material identity under illumination and geometry changes, and coherent patch-level segmentations that reflect reflectance and texture rather than semantic object parts. The work highlights the potential of unsupervised physical feature learning to support physics-aware perception tasks in vision and graphics, offering a scalable path beyond supervised material annotation.

Abstract

Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.

Paper Structure

This paper contains 25 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Dataset samples. Each row shows a different template, first with no materials applied (first column) and then different materials (following columns). Materials are rendered with transparency on objects matching their semantic categories, and randomized object and lighting rotation.
  • Figure 2: eat training pipeline. Two renderings of the same material are sampled and augmented with a multi-crop strategy that yields global and local views. The student processes all crops (globals and locals), with random masking on patch tokens for latent reconstruction; the teacher and the Gram teacher process global crops only. Both networks output class- and patch-level embeddings. At the image level, Sinkhorn-balanced teacher assignments supervise student prototype predictions for all student views of the same material, defining $\mathcal{L}_{\text{image}}$. At the patch level, masked student tokens are regressed to the teacher tokens at matching spatial indices, giving $\mathcal{L}_{\text{patch}}$. On the student global feature before the prototype head, KoLeo encourages dispersion ($\mathcal{L}_{\text{KoLeo}}$) and an in-batch InfoNCE pulls together the two views of the same material while pushing away other materials ($\mathcal{L}_{\text{contrast}}$). Gram anchoring aligns second-order structure on global crops. The teacher is an EMA of the student, and the Gram teacher is a frozen snapshot used only for Gram anchoring.
  • Figure 3: Patch-wise similarity and unsupervised segmentation. Left group shows cosine similarity maps between the embedding of a reference patch (red cross) and all others, visualizing the spatial coherence of learned representations. We show examples gradually growing from a mostly flat surface to a medium scale scene. Right group displays K-means segmentations obtained from the patch embeddings. Compared to DINOv2 and DINOv3, eat produces similarity responses and clusters that are more spatially consistent and physically meaningful, grouping regions by reflectance and texture rather than by semantic or geometric cues.