Table of Contents
Fetching ...

A General Protocol to Probe Large Vision Models for 3D Physical Understanding

Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman

TL;DR

The paper tackles how much large vision models implicitly understand 3D physical properties from 2D images. It introduces a general, lightweight probing protocol that trains simple linear classifiers on features from off-the-shelf models to predict region-pair 3D properties, using real-image datasets with ground truth. Results show Stable Diffusion and DINOv2 encode geometry, shadows, depth well, with time-step and layer selection affecting property specialization; materials and occlusion are harder to capture linearly. The findings provide a practical method to assess 3D understanding, guide downstream task design, and help detect_generated images based on properties the models struggle with.

Abstract

Our objective in this paper is to probe large vision models to determine to what extent they 'understand' different physical properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a general and lightweight protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene, by training discriminative classifiers on the features for these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures, and large vision models including CLIP, DINOv1, DINOv2, VQGAN, Stable Diffusion. (iii) We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties, including scene geometry, support relations, shadows and depth, but less performant for occlusion and material, while outperforming DINOv1, CLIP and VQGAN for all properties. (iv) It is observed that different time steps of Stable Diffusion features, as well as different transformer layers of DINO/CLIP/VQGAN, are good at different properties, unlocking potential applications of 3D physical understanding.

A General Protocol to Probe Large Vision Models for 3D Physical Understanding

TL;DR

The paper tackles how much large vision models implicitly understand 3D physical properties from 2D images. It introduces a general, lightweight probing protocol that trains simple linear classifiers on features from off-the-shelf models to predict region-pair 3D properties, using real-image datasets with ground truth. Results show Stable Diffusion and DINOv2 encode geometry, shadows, depth well, with time-step and layer selection affecting property specialization; materials and occlusion are harder to capture linearly. The findings provide a practical method to assess 3D understanding, guide downstream task design, and help detect_generated images based on properties the models struggle with.

Abstract

Our objective in this paper is to probe large vision models to determine to what extent they 'understand' different physical properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a general and lightweight protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene, by training discriminative classifiers on the features for these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures, and large vision models including CLIP, DINOv1, DINOv2, VQGAN, Stable Diffusion. (iii) We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties, including scene geometry, support relations, shadows and depth, but less performant for occlusion and material, while outperforming DINOv1, CLIP and VQGAN for all properties. (iv) It is observed that different time steps of Stable Diffusion features, as well as different transformer layers of DINO/CLIP/VQGAN, are good at different properties, unlocking potential applications of 3D physical understanding.
Paper Structure (31 sections, 5 equations, 15 figures, 11 tables)

This paper contains 31 sections, 5 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Motivation: What do large vision models know about the 3D scene? We take Stable Diffusion as an example because Stable Diffusion is generative, and so its output is an image that can be judged directly for verisimilitude. The Stable Diffusion inpainting model is here tasked with inpainting the masked region of the real images. It correctly predicts a shadow consistent with the lighting direction (top), and a supporting structure consistent with the scene geometry (bottom). This indicates that the Stable Diffusion model generation is consistent with the geometry (of the light source direction) and physical (support) properties. These examples are only for illustration and we probe a general Stable Diffusion network to determine whether there are explicit features for such 3D scene properties. The appendix provides more examples of Stable Diffusion's capability to predict different physical properties of the scene.
  • Figure 2: Example images for probing scene geometry. The first row shows a sample annotation for the same plane, and the second row is a sample annotation for perpendicular plane. Here, and in the following figures, ( A, B) are a positive pair, while ( A, C) are negative. The images are from the ScanNetv2 dataset dai2017scannet with annotations for planes from liu2019planercnn. In the first row, the first piece of floor ( A) is on the same plane as the second piece of floor ( B), but is not on the same plane as the surface of the drawers ( C). In the second row, the table top ( A) is perpendicular to the wall ( B), but is not perpendicular to the stool top ( C).
  • Figure 3: Example images for probing material, support relation and shadow. The first row is for material, the second row for support relation, and the third row for shadow. First row: the material images are from the DMS dataset dmsdataset. The paintings are both covered with glass ( A and B) whereas the curtain ( C) is made of fabric. Second row: the support relation images are from the NYUv2 dataset silberman2012indoor. The paper ( A) is supported by the table ( B), but it is not supported by the chair ( C). Third row: the shadow images are from the SOBA dataset Wang_2020_soba. The person ( A) has the shadow ( B), not the shadow ( C).
  • Figure 4: Example images for probing viewpoint-dependent properties (occlusion & depth). The first row is for occlusion and the second row is for depth. First row: the occlusion images are from the Separated COCO dataset zhan2022triocc. The sofa ( A) and the sofa ( B) are part of the same object, whilst the monitor ( C) is not part of the sofa. Second row: the depth images are from the NYUv2 dataset silberman2012indoor. The chair ( A) is farther away than the object on the floor ( B), but it is closer than the cupboard ( C).
  • Figure 5: (a) Nomenclature for the U-Net Layers. We probe 4 downsampling encoder layers $E_1$-$E_4$ and 4 upsampling decoder layers $D_1$-$D_4$ of the Stable Diffusion U-Net. (b) A prediction failure for Material. In this example the model does not predict that the two regions are made of the same material (fabric). (c) A prediction failure for Occlusion. In this example the model does not predict that the two regions belong to the same object (the sofa).
  • ...and 10 more figures