Table of Contents
Fetching ...

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

James Burgess, Kuan-Chieh Wang, Serena Yeung-Levy

TL;DR

This paper investigates whether 2D diffusion models implicitly encode 3D scene representations by discovering a controllable 3D viewpoint token in the text embedding space. It introduces Viewpoint Neural Textual Inversion (ViewNeTI), which learns a small mapper to predict a view token $v_{\mathbf{R}}$ from camera cues, enabling continuous, view-controlled image generation without modifying the diffusion model. The authors demonstrate a continuous, single-scene view-control manifold and provide evidence for a general, cross-scene view-control manifold when learning across many scenes, including applications to view-controlled T2I generation and novel view synthesis from a single image, achieving state-of-the-art LPIPS on DTU. These results suggest that frozen 2D diffusion models harbor a latent 3D scene representation, offering a data-efficient pathway to 3D vision tasks without explicit 3D supervision.

Abstract

Text-to-image diffusion models generate impressive and realistic images, but do they learn to represent the 3D world from only 2D supervision? We demonstrate that yes, certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. Our approach, Viewpoint Neural Textual Inversion (ViewNeTI), is to discover 3D view tokens; these tokens control the 3D viewpoint - the rendering pose in a scene - of generated images. Specifically, we train a small neural mapper to take continuous camera viewpoint parameters and predict a view token (a word embedding). This token conditions diffusion generation via cross-attention to produce images with the desired camera viewpoint. Using ViewNeTI as an evaluation tool, we report two findings: first, the text latent space has a continuous view-control manifold for particular 3D scenes; second, we find evidence for a generalized view-control manifold for all scenes. We conclude that since the view token controls the 3D `rendering' viewpoint, there is likely a scene representation embedded in frozen 2D diffusion models. Finally, we exploit the 3D scene representations for 3D vision tasks, namely, view-controlled text-to-image generation, and novel view synthesis from a single image, where our approach sets state-of-the-art for LPIPS. Code available at https://github.com/jmhb0/view_neti

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

TL;DR

This paper investigates whether 2D diffusion models implicitly encode 3D scene representations by discovering a controllable 3D viewpoint token in the text embedding space. It introduces Viewpoint Neural Textual Inversion (ViewNeTI), which learns a small mapper to predict a view token from camera cues, enabling continuous, view-controlled image generation without modifying the diffusion model. The authors demonstrate a continuous, single-scene view-control manifold and provide evidence for a general, cross-scene view-control manifold when learning across many scenes, including applications to view-controlled T2I generation and novel view synthesis from a single image, achieving state-of-the-art LPIPS on DTU. These results suggest that frozen 2D diffusion models harbor a latent 3D scene representation, offering a data-efficient pathway to 3D vision tasks without explicit 3D supervision.

Abstract

Text-to-image diffusion models generate impressive and realistic images, but do they learn to represent the 3D world from only 2D supervision? We demonstrate that yes, certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. Our approach, Viewpoint Neural Textual Inversion (ViewNeTI), is to discover 3D view tokens; these tokens control the 3D viewpoint - the rendering pose in a scene - of generated images. Specifically, we train a small neural mapper to take continuous camera viewpoint parameters and predict a view token (a word embedding). This token conditions diffusion generation via cross-attention to produce images with the desired camera viewpoint. Using ViewNeTI as an evaluation tool, we report two findings: first, the text latent space has a continuous view-control manifold for particular 3D scenes; second, we find evidence for a generalized view-control manifold for all scenes. We conclude that since the view token controls the 3D `rendering' viewpoint, there is likely a scene representation embedded in frozen 2D diffusion models. Finally, we exploit the 3D scene representations for 3D vision tasks, namely, view-controlled text-to-image generation, and novel view synthesis from a single image, where our approach sets state-of-the-art for LPIPS. Code available at https://github.com/jmhb0/view_neti
Paper Structure (30 sections, 4 equations, 22 figures, 1 table)

This paper contains 30 sections, 4 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: We find '3D view tokens’ in the Stable Diffusion word embedding space. (a) Given a camera pose, we predict a token (word embedding), which we use to condition diffusion generation. (b) Different view tokens give different views of the generated 3D scene. We use 3D view tokens to study scene representations in diffusion models.
  • Figure 2: A masked-out car (left) with infilling (images 2 to 4) by a Stable Diffusion model stable_diffusion_rombach2022high, with important details marked with orange dots. Infill image 1 has shadows that are consistent with the shadows on the car. Infill image 2 has object reflections. Infill image 3 has reflections and shadows. This is evidence that 2D diffusion models are capable of 3D reasoning, which motivates our investigation into 3D view control.
  • Figure 3: We find a continuous view-control manifold in word embedding space for one scene, by learning a token from a few training views that generalizes to test views.
  • Figure 4: Evidence for a semantically disentangled view-control manifold. Each scene (columns) maps to a scene token, while each view (rows) maps to a view token that is shared across scenes.
  • Figure 5: Training procedure for the '3D view token' in Viewpoint Neural Textual Inversion (ViewNeTI), our method for evaluating 3D representations in the word embedding space of frozen diffusion models. (a) To optimize a single scene (\ref{['sec:single-scene-optimization']}), we have (top) a small multi-view dataset, $\mathcal{D}_{MV}$ with images, $\mathbf{x}_i$, and camera poses, $\mathbf{R}_i$. We create a caption for each image, with a token $S_{\mathbf{R}_i}$ for each view, $\mathbf{R}_i$. Bottom: the embedding for $S_{\mathbf{R}_i}$ is $\mathbf{v_{\mathbf{R}_i}}$ and is predicted with a neural network $\mathcal{M}_v$, conditioned on camera parameters, $\mathbf{R}_i$, as well as the diffusion timestep $t$, and UNet layer $\ell$. All parameters are encoded by a Fourier feature mapper, $\gamma$tancik2020fourier2. The other tokens take their regular word embeddings. The prompt is passed to the CLIP text encoder clip_radford2021learning, then the text embedding is passed to the UNet via cross-attention stable_diffusion_rombach2022high. We do diffusion model training on this dataset while optimizing only $\mathcal{M}_v$ (this is textual inversion training textual_inversion_gal2022neti_alaluf2023neural). (b) To optimize multiple scenes (\ref{['sec:multi-scene-optimization']}), we have a multi-view dataset with multiple scenes but shared camera poses $\mathbf{R}_i$. The optimization is the same, except each scene, $s_j$, has its own scene token $S_{s_j}$ in the caption. The view tokens, $S_{\mathbf{R}_i}$ are shared over the scenes. The embedding for $S_{s_j}$ is $\mathbf{v}_{s_j}$ and is predicted by a scene-mapper, $\mathcal{M}_{s_j}$, conditioned on timestep, $t$ and UNet layer, $\ell$. The $\mathcal{M}_v$ and $\mathcal{M}_{s_j}$ are jointly optimized.
  • ...and 17 more figures