Table of Contents
Fetching ...

Bringing Objects to Life: training-free 4D generation from 3D objects through view consistent noise

Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik

TL;DR

Bringing Objects to Life tackles the challenge of generating controllable 4D content from static 3D objects without training new models. It channels a static 4D NeRF derived from a 3D mesh into an image-to-video diffusion process conditioned on rendered views, guided by a view-point-consistent noise scheme and an attention-guided masked SDS loss to preserve object identity while enabling dynamic motion. The method, named 3D24D, demonstrates improved temporal coherence, prompt adherence, and visual fidelity on two object datasets compared to a multi-view baseline, illustrating the feasibility of training-free 3D-to-4D generation. This approach could significantly impact content creation for virtual worlds, gaming, and interactive media by providing flexible, prompt-driven 4D content without retraining, albeit with current memory and model-dependency limitations.

Abstract

Recent advancements in generative models have enabled the creation of dynamic 4D content - 3D objects in motion - based on text prompts, which holds potential for applications in virtual worlds, media, and gaming. Existing methods provide control over the appearance of generated content, including the ability to animate 3D objects. However, their ability to generate dynamics is limited to the mesh datasets they were trained on, lacking any growth or structural development capability. In this work, we introduce a training-free method for animating 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom general scenes while maintaining the original object's identity. We first convert a 3D mesh into a static 4D Neural Radiance Field (NeRF) that preserves the object's visual attributes. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce a view-consistent noising protocol that aligns object perspectives with the noising process to promote lifelike movement, and a masked Score Distillation Sampling (SDS) loss that leverages attention maps to focus optimization on relevant regions, better preserving the original object. We evaluate our model on two different 3D object datasets for temporal coherence, prompt adherence, and visual fidelity, and find that our method outperforms the baseline based on multiview training, achieving better consistency with the textual prompt in hard scenarios.

Bringing Objects to Life: training-free 4D generation from 3D objects through view consistent noise

TL;DR

Bringing Objects to Life tackles the challenge of generating controllable 4D content from static 3D objects without training new models. It channels a static 4D NeRF derived from a 3D mesh into an image-to-video diffusion process conditioned on rendered views, guided by a view-point-consistent noise scheme and an attention-guided masked SDS loss to preserve object identity while enabling dynamic motion. The method, named 3D24D, demonstrates improved temporal coherence, prompt adherence, and visual fidelity on two object datasets compared to a multi-view baseline, illustrating the feasibility of training-free 3D-to-4D generation. This approach could significantly impact content creation for virtual worlds, gaming, and interactive media by providing flexible, prompt-driven 4D content without retraining, albeit with current memory and model-dependency limitations.

Abstract

Recent advancements in generative models have enabled the creation of dynamic 4D content - 3D objects in motion - based on text prompts, which holds potential for applications in virtual worlds, media, and gaming. Existing methods provide control over the appearance of generated content, including the ability to animate 3D objects. However, their ability to generate dynamics is limited to the mesh datasets they were trained on, lacking any growth or structural development capability. In this work, we introduce a training-free method for animating 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom general scenes while maintaining the original object's identity. We first convert a 3D mesh into a static 4D Neural Radiance Field (NeRF) that preserves the object's visual attributes. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce a view-consistent noising protocol that aligns object perspectives with the noising process to promote lifelike movement, and a masked Score Distillation Sampling (SDS) loss that leverages attention maps to focus optimization on relevant regions, better preserving the original object. We evaluate our model on two different 3D object datasets for temporal coherence, prompt adherence, and visual fidelity, and find that our method outperforms the baseline based on multiview training, achieving better consistency with the textual prompt in hard scenarios.
Paper Structure (23 sections, 5 equations, 9 figures, 2 tables)

This paper contains 23 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our method, 3D24D, takes a static 3D object and a textual prompt describing a desired action. It then adds dynamics to the object based on the prompt to create a 4D animation, essentially a video viewable from any perspective. On the right, we display four 3D frames from the generated 4D animation. Each 3D frame contains an RGB image and a corresponding depth map on its bottom left.
  • Figure 2: Workflow of our 3D24D approach, designed to optimize a 4D radiance field using a neural representation that captures both static and dynamic elements. First, a 4D NeRF is trained to represent the static object (plant, left), having the same 3D structure at each time step. Then, we introduce dynamics to the 4D NeRF by distilling the prior from a pre-trained image-to-video model. At each SDS step, we select a viewpoint and render both the input object, the noise sphere, and the 4D NeRF from the same selected viewpoint. These renders, along with the textual prompts, are then fed into the image-to-video model, and the SDS loss is calculated to guide the generation of motion while preserving the object's identity. The noise is rendered from the sphere using the same viewpoint as the static object, providing better consistency at each step.
  • Figure 3: 3D24D brings various objects to life. On the left, we display the input object along with a textual prompt describing the desired action. On the right, we present four frames from the generated object, viewed from the front. Each 3D frame is split into an RGB image and its corresponding depth map, shown in the top right corner.
  • Figure 4: Qualitative comparison. A render of the input object is shown on the left, alongside renders from 3D24D (middle) and Animate3D (right). In this example, our method generates a 4D object that is better aligned with the prompt "an elephant grows its ears as long as wings to fly,".
  • Figure 5: Qualitative ablation results demonstrate the contribution of each part of our method. Without our view-consistency noising the broccoli does not "bloom". Without our attention-masked SDS, the plant is less rich in details.
  • ...and 4 more figures