Table of Contents
Fetching ...

Learning Continuous 3D Words for Text-to-Image Generation

Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix, Matthew Fisher, Radomir Mech, Andrew Markham, Niki Trigoni

TL;DR

Continuous 3D Words introduce a lightweight, continuous token space that encodes 3D-aware attributes (e.g., illumination, pose, camera parameters) into text prompts for diffusion models. A two-stage training regimen plus a lightweight MLP mapper g_phi(a) enables interpolation and disentanglement of object identity from attributes, while ControlNet augmentations diversify backgrounds and textures to improve generalization. Empirical results on single and multi-attribute settings show superior controllability and realism compared with baselines, and real-world editing capabilities are demonstrated via Dreambooth-based token injection. The approach enables fine-grained, 3D-aware control in text-to-image generation with minimal additional computational overhead, and scales to multi-attribute scenarios and cross-object transfer.

Abstract

Current controls over diffusion models (e.g., through text or ControlNet) for image generation fall short in recognizing abstract, continuous attributes like illumination direction or non-rigid shape change. In this paper, we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner -- we call them Continuous 3D Words. These attributes can, for example, be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine, we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes, including time-of-day illumination, bird wing orientation, dollyzoom effect, and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process. Project Page: https://ttchengab.github.io/continuous_3d_words

Learning Continuous 3D Words for Text-to-Image Generation

TL;DR

Continuous 3D Words introduce a lightweight, continuous token space that encodes 3D-aware attributes (e.g., illumination, pose, camera parameters) into text prompts for diffusion models. A two-stage training regimen plus a lightweight MLP mapper g_phi(a) enables interpolation and disentanglement of object identity from attributes, while ControlNet augmentations diversify backgrounds and textures to improve generalization. Empirical results on single and multi-attribute settings show superior controllability and realism compared with baselines, and real-world editing capabilities are demonstrated via Dreambooth-based token injection. The approach enables fine-grained, 3D-aware control in text-to-image generation with minimal additional computational overhead, and scales to multi-attribute scenarios and cross-object transfer.

Abstract

Current controls over diffusion models (e.g., through text or ControlNet) for image generation fall short in recognizing abstract, continuous attributes like illumination direction or non-rigid shape change. In this paper, we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner -- we call them Continuous 3D Words. These attributes can, for example, be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine, we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes, including time-of-day illumination, bird wing orientation, dollyzoom effect, and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process. Project Page: https://ttchengab.github.io/continuous_3d_words
Paper Structure (18 sections, 3 equations, 13 figures, 1 table)

This paper contains 18 sections, 3 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: We introduce Continuous 3D Words -- special tokens in text-to-image models that allow users to have fine-grained control over several attributes like illumination [ ] (a and c), non-rigid shape change [ ] (d), orientation [ ] (c and d), and camera parameters [ ] (b). Our approach can be trained using a single 3D mesh and a rendering engine while incurring into negligible runtime and memory costs.
  • Figure 2: Method Overview.Finetuning: Our finetuning is divided into two stages. In the first stage, we render a series of images using different attribute values (e.g., illumination and pose). We feed them into the text-to-image diffusion model to learn token embedding [Obj] representing the single mesh used for training. In the second stage, we add the tokens representing individual attributes into the prompt embedding. The two stage training allows us to better disentangle the individual attributes against [Obj]. Inference: Attributes can be applied to different objects for text-to-image generation.
  • Figure 3: ControlNet Augmentations. Depth ControlNet is used for attributes creating direct shape changes. Lineart ControlNet is applied for more subtle changes that cannot be reflected by depths (e.g., illumination).
  • Figure 4: Qualitative Comparisons. We compare our Continuous 3D Words trained under three settings against ControlNet of various strengths. Note that the dollyzoom setup was trained with trained with multiple chair meshes, so we give additional ControlNet by manually picking the chair rendering that best follows the prompt (i.e., "comfortable" and in "the office").
  • Figure 5: Disentangling Multiple Attributes. We show four examples of controlling multiple Continuous 3D words in addition to text descriptions. The first 6 rows were trained with a single golden retriever mesh, while the bottom 6 were trained with a single animated dove.
  • ...and 8 more figures