Table of Contents
Fetching ...

Customizing Text-to-Image Diffusion with Object Viewpoint Control

Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu

TL;DR

This work addresses the lack of explicit object viewpoint control in text-to-image diffusion model customization by introducing CustomDiffusion360, which embeds 3D viewpoint information via a FeatureNeRF-based module into a frozen diffusion backbone to condition generations on target views. It learns view-dependent features from multi-view references and fuses them with 2D diffusion features to synthesize customized objects in new contexts while preserving identity. Across CO3Dv2 and NAVI datasets, it outperforms image-editing and prior customization baselines in aligning with both the prompt and the target viewpoint, with favorable human judgments. The method enables robust, viewpoint-aware object customization and enables panaroma synthesis and multi-object compositions when combined with existing editing and diffusion techniques.

Abstract

Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt.

Customizing Text-to-Image Diffusion with Object Viewpoint Control

TL;DR

This work addresses the lack of explicit object viewpoint control in text-to-image diffusion model customization by introducing CustomDiffusion360, which embeds 3D viewpoint information via a FeatureNeRF-based module into a frozen diffusion backbone to condition generations on target views. It learns view-dependent features from multi-view references and fuses them with 2D diffusion features to synthesize customized objects in new contexts while preserving identity. Across CO3Dv2 and NAVI datasets, it outperforms image-editing and prior customization baselines in aligning with both the prompt and the target viewpoint, with favorable human judgments. The method enables robust, viewpoint-aware object customization and enables panaroma synthesis and multi-object compositions when combined with existing editing and diffusion techniques.

Abstract

Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt.
Paper Structure (17 sections, 12 equations, 18 figures, 5 tables)

This paper contains 17 sections, 12 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Given multi-view images of a new object (left), denoted as V$^*$ <category name>, we create a customized text-to-image diffusion model with object viewpoint control. The customized model allows users to specify the target viewpoint for the object while synthesizing it in novel appearances and scenes, such as A green V$^*$ car, or A beetle-like V$^*$ car. We can also generate panorama images or compose multiple concepts while controlling each object's viewpoint by using MultiDiffusion bar2023multidiffusion with our model.
  • Figure 2: Overview. We propose a model customization method that utilizes $N$ reference images defining the 3D structure of an object $\mathcal{Y}$ (we illustrate with $N=2$ views for simplicity). We modify the diffusion model U-Net with pose-conditioned transformer blocks. Our Pose-conditioned transformer block features a FeatureNeRF module, which aggregates features from the individual viewpoints to target viewpoint $\phi$, as shown in detail in Figure \ref{['fig:feature_nerf']}. The rendered feature $W_{y}$ is concatenated with the target noisy feature $W_{\mathbf{x}}$ and projected to the original channel dimension. We use the diffusion U-Net itself to extract features of reference images, as shown in the top row. We only fine-tune the new parameters in linear projection layer $l$ and FeatureNerF in $F_{\text{pose}}$ blocks.
  • Figure 3: FeatureNeRF. We predict volumetric features $\overline{\mathbf{V}}$ for each 3D point in the grid using reference features $\{{\mathbf{W}}_i\}$ (Eqn. \ref{['eq:featurenerf1']}). Given this feature, we predict the density $\sigma$ and color $rgb$ using a 2-layer MLP and use the predicted density $\sigma$ to render $\hat{\mathbf{V}}$ (which has been updated with text cross-attention $g$). The $rgb$ is only used to calculate reconstruction loss during training.
  • Figure 4: Qualitative comparison. Given a particular target pose, we show the qualitative comparison of our method with (1) Image editing methods SDEdit, InstructPix2Pix, and LEDITS++, which edit a NeRF-rendered image from the input pose, (2) ViCA-NeRF, a 3D editing method that trains a NeRF model for each input prompt, and (3) LoRA + Camera pose, our proposed baseline where we concatenate camera pose information to text embeddings during LoRA fine-tuning. Our method performs on par or better in keeping the target identity and poses while incorporating the new text prompt---e.g., putting a picnic table next to the SUV car ($1^{\text{st}}$ column)---and following multiple text conditions---e.g., turning the chair red and placing it in a white room ($3^{\text{rd}}$ column). V$^*$ token is used only in ours and the LoRA + Camera pose method. Ground truth rendering from the given pose is shown as an inset in the first three rows. We show more sample comparisons in Figure \ref{['fig:result_appendix']} of Appendix.
  • Figure 5: Qualitative samples with varying object viewpoint and text prompt. Our method learns the identity of custom objects while allowing the user to control the object viewpoint and generating the object in new contexts using the text prompt, e.g., changing the background scene or object color and shape. In each row, the images were generated with the same seed while changing the object viewpoint in a turntable manner. Note that each image in a row is independently generated. Figure \ref{['fig:result2_appendix']} in the Appendix shows more such samples.
  • ...and 13 more figures