Table of Contents
Fetching ...

Generative Models: What Do They Know? Do They Know Things? Let's Find Out!

Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, Anand Bhattad

TL;DR

Generative models implicitly encode intrinsic scene properties such as depth, normals, albedo, and shading. The authors introduce a parameter-efficient LoRA-based framework that recovers these intrinsics across GANs, autoregressive, and diffusion models using the same image-generation head and pseudo-ground-truth supervision. They show that tiny LoRA adapters (as little as rank-2) with a few hundred labeled examples suffice, and that better generators yield more accurate intrinsic recoveries, outperforming linear probing and full fine-tuning in low-data regimes. This work offers a practical pathway to intrinsic image manipulation and relighting tasks without extensive retraining, while linking generator quality to the recoverability of scene-structure representations.

Abstract

Generative models excel at mimicking real scenes, suggesting they might inherently encode important intrinsic scene properties. In this paper, we aim to explore the following key questions: (1) What intrinsic knowledge do generative models like GANs, Autoregressive models, and Diffusion models encode? (2) Can we establish a general framework to recover intrinsic representations from these models, regardless of their architecture or model type? (3) How minimal can the required learnable parameters and labeled data be to successfully recover this knowledge? (4) Is there a direct link between the quality of a generative model and the accuracy of the recovered scene intrinsics? Our findings indicate that a small Low-Rank Adaptators (LoRA) can recover intrinsic images-depth, normals, albedo and shading-across different generators (Autoregressive, GANs and Diffusion) while using the same decoder head that generates the image. As LoRA is lightweight, we introduce very few learnable parameters (as few as 0.04% of Stable Diffusion model weights for a rank of 2), and we find that as few as 250 labeled images are enough to generate intrinsic images with these LoRA modules. Finally, we also show a positive correlation between the generative model's quality and the accuracy of the recovered intrinsics through control experiments.

Generative Models: What Do They Know? Do They Know Things? Let's Find Out!

TL;DR

Generative models implicitly encode intrinsic scene properties such as depth, normals, albedo, and shading. The authors introduce a parameter-efficient LoRA-based framework that recovers these intrinsics across GANs, autoregressive, and diffusion models using the same image-generation head and pseudo-ground-truth supervision. They show that tiny LoRA adapters (as little as rank-2) with a few hundred labeled examples suffice, and that better generators yield more accurate intrinsic recoveries, outperforming linear probing and full fine-tuning in low-data regimes. This work offers a practical pathway to intrinsic image manipulation and relighting tasks without extensive retraining, while linking generator quality to the recoverability of scene-structure representations.

Abstract

Generative models excel at mimicking real scenes, suggesting they might inherently encode important intrinsic scene properties. In this paper, we aim to explore the following key questions: (1) What intrinsic knowledge do generative models like GANs, Autoregressive models, and Diffusion models encode? (2) Can we establish a general framework to recover intrinsic representations from these models, regardless of their architecture or model type? (3) How minimal can the required learnable parameters and labeled data be to successfully recover this knowledge? (4) Is there a direct link between the quality of a generative model and the accuracy of the recovered scene intrinsics? Our findings indicate that a small Low-Rank Adaptators (LoRA) can recover intrinsic images-depth, normals, albedo and shading-across different generators (Autoregressive, GANs and Diffusion) while using the same decoder head that generates the image. As LoRA is lightweight, we introduce very few learnable parameters (as few as 0.04% of Stable Diffusion model weights for a rank of 2), and we find that as few as 250 labeled images are enough to generate intrinsic images with these LoRA modules. Finally, we also show a positive correlation between the generative model's quality and the accuracy of the recovered intrinsics through control experiments.
Paper Structure (18 sections, 3 equations, 33 figures, 6 tables)

This paper contains 18 sections, 3 equations, 33 figures, 6 tables.

Figures (33)

  • Figure 1: FID vs. metrics of intrinsics recovered from different generative models traind on FFHQ. Enhancements in image generation quality correlate positively with intrinsic recovery capabilities.
  • Figure 2: Overview of our framework applied to Stable Diffusion's UNet in a single-step manner. We adopt an efficient fine-tuning approach, low-rank adaptors (LoRA) corresponding to key feature maps -- attention matrices -- to reveal scene intrinsics. Distinct adaptors are optimized for each intrinsic ($\color{violet}{\textbf{violet}}$ adaptors for surface normals; swappable with other intrinsics). We use a few labeled examples for this fine-tuning and directly obtain scene intrinsics using the same decoder that generates images, circumventing the need for specialized decoders or comprehensive model re-training.
  • Figure 3: Scene intrinsics from VQGAN, StyleGAN-v2, and StyleGAN-XL -- trained on FFHQ dataset: The "image" column shows the synthetic images produced by each model. Other columns show four scene intrinsics predicted by a SOTA non-generative model and recovered by LoRA.
  • Figure 4: Our recovered scene intrinsics from StyleGAN-v2 trained on LSUN bedroom images.
  • Figure 5: StyleGAN-XL on ImageNet. Recovered surface normals and depth maps, while capturing the basic shape and volume, lack precise detail and display artifacts. Albedo and Shading recoveries fail. These results are correlated with the overall bad image generation quality.
  • ...and 28 more figures