Table of Contents
Fetching ...

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A. Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey R. Allen, Thomas Kipf

TL;DR

This work proposes to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene, and achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets.

Abstract

We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

TL;DR

This work proposes to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene, and achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets.

Abstract

We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).
Paper Structure (27 sections, 3 equations, 14 figures, 5 tables)

This paper contains 27 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: 3D-aware editing with our Neural Asset representations. Given a source image and object 3D bounding boxes, we can translate, rotate, and rescale the object. In addition, we support compositional generation by transferring objects or backgrounds across images.
  • Figure 2: Neural Assets framework. (a) We train our model on pairs of video frames, which contain objects under different poses. We encode appearance tokens from a source image with $\mathrm{RoIAlign}$, and pose tokens from the objects' 3D bounding boxes in a target image. They are combined to form our Neural Asset representations. (b) An image diffusion model is conditioned on Neural Assets and a separate background token to reconstruct the target image as the training signal. (c) During inference, we can manipulate the Neural Assets to control the objects in the generated image: rotate the object's pose ( blue) or replace an object by a different one from another image ( pink).
  • Figure 3: Single-object editing results on OBJect unseen object subset. We evaluate on the Translation, Rotation, and Removal tasks. We follow 3DIT OBJect3DIT to compute metrics inside the edited object's bounding box. Our results are averaged over 3 random seeds.
  • Figure 4: Multi-object editing results on MOVi-E, Objectron, and Waymo Open (denoted as Waymo in the figures). We compute metrics inside the edited objects' bounding boxes.
  • Figure 5: Qualitative comparison on MOVi-E, Objectron, and Waymo Open. All models generate a new image given a source image and the 3D bounding box of target objects. Our method performs the best in object identity preservation, editing accuracy, and background modeling.
  • ...and 9 more figures