Table of Contents
Fetching ...

4Dynamic: Text-to-4D Generation with Hybrid Priors

Yu-Jie Yuan, Leif Kobbelt, Jiwen Liu, Yuan Zhang, Pengfei Wan, Yu-Kun Lai, Lin Gao

TL;DR

The paper addresses the challenge of generating dynamic 4D scenes (text-to-4D NeRF) by combining a two-stage NeRF pipeline with a direct video prior. It introduces a dynamic representation that fuses a deformation network with a topology network and employs a prior-switching training strategy to balance direct priors from a reference video with diffusion priors from a text-to-video model. Through extensive experiments, it reports state-of-the-art results for text-to-4D and monocular video-to-4D generation, validated by CLIP-based metrics and user studies, and demonstrates improved geometry, texture, and temporal coherence. This approach advances practical 4D content creation from text or single-video inputs, with implications for media, AR/VR, and interactive storytelling, while noting limitations and ethical considerations.

Abstract

Due to the fascinating generative performance of text-to-image diffusion models, growing text-to-3D generation works explore distilling the 2D generative priors into 3D, using the score distillation sampling (SDS) loss, to bypass the data scarcity problem. The existing text-to-3D methods have achieved promising results in realism and 3D consistency, but text-to-4D generation still faces challenges, including lack of realism and insufficient dynamic motions. In this paper, we propose a novel method for text-to-4D generation, which ensures the dynamic amplitude and authenticity through direct supervision provided by a video prior. Specifically, we adopt a text-to-video diffusion model to generate a reference video and divide 4D generation into two stages: static generation and dynamic generation. The static 3D generation is achieved under the guidance of the input text and the first frame of the reference video, while in the dynamic generation stage, we introduce a customized SDS loss to ensure multi-view consistency, a video-based SDS loss to improve temporal consistency, and most importantly, direct priors from the reference video to ensure the quality of geometry and texture. Moreover, we design a prior-switching training strategy to avoid conflicts between different priors and fully leverage the benefits of each prior. In addition, to enrich the generated motion, we further introduce a dynamic modeling representation composed of a deformation network and a topology network, which ensures dynamic continuity while modeling topological changes. Our method not only supports text-to-4D generation but also enables 4D generation from monocular videos. The comparison experiments demonstrate the superiority of our method compared to existing methods.

4Dynamic: Text-to-4D Generation with Hybrid Priors

TL;DR

The paper addresses the challenge of generating dynamic 4D scenes (text-to-4D NeRF) by combining a two-stage NeRF pipeline with a direct video prior. It introduces a dynamic representation that fuses a deformation network with a topology network and employs a prior-switching training strategy to balance direct priors from a reference video with diffusion priors from a text-to-video model. Through extensive experiments, it reports state-of-the-art results for text-to-4D and monocular video-to-4D generation, validated by CLIP-based metrics and user studies, and demonstrates improved geometry, texture, and temporal coherence. This approach advances practical 4D content creation from text or single-video inputs, with implications for media, AR/VR, and interactive storytelling, while noting limitations and ethical considerations.

Abstract

Due to the fascinating generative performance of text-to-image diffusion models, growing text-to-3D generation works explore distilling the 2D generative priors into 3D, using the score distillation sampling (SDS) loss, to bypass the data scarcity problem. The existing text-to-3D methods have achieved promising results in realism and 3D consistency, but text-to-4D generation still faces challenges, including lack of realism and insufficient dynamic motions. In this paper, we propose a novel method for text-to-4D generation, which ensures the dynamic amplitude and authenticity through direct supervision provided by a video prior. Specifically, we adopt a text-to-video diffusion model to generate a reference video and divide 4D generation into two stages: static generation and dynamic generation. The static 3D generation is achieved under the guidance of the input text and the first frame of the reference video, while in the dynamic generation stage, we introduce a customized SDS loss to ensure multi-view consistency, a video-based SDS loss to improve temporal consistency, and most importantly, direct priors from the reference video to ensure the quality of geometry and texture. Moreover, we design a prior-switching training strategy to avoid conflicts between different priors and fully leverage the benefits of each prior. In addition, to enrich the generated motion, we further introduce a dynamic modeling representation composed of a deformation network and a topology network, which ensures dynamic continuity while modeling topological changes. Our method not only supports text-to-4D generation but also enables 4D generation from monocular videos. The comparison experiments demonstrate the superiority of our method compared to existing methods.
Paper Structure (14 sections, 5 equations, 10 figures, 1 table)

This paper contains 14 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: We propose a novel text-to-4D NeRF generation method, 4Dynamic, which exploits not only a generative prior from the score distillation sampling (SDS) but also a direct prior from the pre-generated reference video. As a result, our method can achieve high-quality 4D generation from a text prompt or a monocular video.
  • Figure 2: The pipeline of our method. We mainly show the dynamic generation process here. A dynamic representation consisting of a deformation network and a topology network is introduced. The NeRF network includes a density network and a color network. The rendered image or video under a random viewpoint is supervised by 2D SDS, customized SDS (BSD) and video SDS losses. Moreover, we exploit the pre-generated reference video to provide direct supervision under the input viewpoint. To balance between different priors, we design a prior-switching training strategy to achieve generation results that have good dynamic motion.
  • Figure 3: Comparisons of the text-to-4D generation with MAV3D singer2023text and 4d-fy bahmani20234d. Our method has significant advantages over MAV3D. For example, the panda face and the rocket have more details, and the emitted smoke looks more realistic. Compared to 4d-fy, our results are more realistic, thanks to the introduction of the direct prior from the pre-generated reference video. Some results of 4d-fy have unreasonable parts, such as a third leg growing behind the panda's back and phantoms appearing in the smoke which are marked by the orange boxes.
  • Figure 4: More comparisons of the text-to-4D generation with 4d-fy bahmani20234d. These results demonstrate that our method can generate more pronounced motions, such as the motions of horse legs and outperforms 4d-fy in terms of dynamic generation.
  • Figure 5: More text-to-4D generation results.
  • ...and 5 more figures