Table of Contents
Fetching ...

PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

Qiaowei Miao, JinSheng Quan, Kehan Li, Yawei Luo

TL;DR

PLA4D tackles the conflict between motion and geometry priors in text-to-4D diffusion-guided synthesis by establishing a pixel-level anchor through a text-generated video. It introduces static and dynamic alignment pipelines, including focal alignment, Gaussian-Mesh contrastive learning, motion alignment, and Time-Multiview refinement, all built on a 4D Gaussian-splatting representation. The approach yields video-like, geometrically consistent 4D objects with substantially reduced generation time (≈15 minutes per sample) and relies only on open-source components. Empirical results show improved geometry, motion coherence, and semantic fidelity compared with prior methods, validated by ablations and user studies."

Abstract

Previous text-to-4D methods have leveraged multiple Score Distillation Sampling (SDS) techniques, combining motion priors from video-based diffusion models (DMs) with geometric priors from multiview DMs to implicitly guide 4D renderings. However, differences in these priors result in conflicting gradient directions during optimization, causing trade-offs between motion fidelity and geometry accuracy, and requiring substantial optimization time to reconcile the models. In this paper, we introduce \textbf{P}ixel-\textbf{L}evel \textbf{A}lignment for text-driven \textbf{4D} Gaussian splatting (PLA4D) to resolve this motion-geometry conflict. PLA4D provides an anchor reference, i.e., text-generated video, to align the rendering process conditioned by different DMs in pixel space. For static alignment, our approach introduces a focal alignment method and Gaussian-Mesh contrastive learning to iteratively adjust focal lengths and provide explicit geometric priors at each timestep. At the dynamic level, a motion alignment technique and T-MV refinement method are employed to enforce both pose alignment and motion continuity across unknown viewpoints, ensuring intrinsic geometric consistency across views. With such pixel-level multi-DM alignment, our PLA4D framework is able to generate 4D objects with superior geometric, motion, and semantic consistency. Fully implemented with open-source tools, PLA4D offers an efficient and accessible solution for high-quality 4D digital content creation with significantly reduced generation time.

PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

TL;DR

PLA4D tackles the conflict between motion and geometry priors in text-to-4D diffusion-guided synthesis by establishing a pixel-level anchor through a text-generated video. It introduces static and dynamic alignment pipelines, including focal alignment, Gaussian-Mesh contrastive learning, motion alignment, and Time-Multiview refinement, all built on a 4D Gaussian-splatting representation. The approach yields video-like, geometrically consistent 4D objects with substantially reduced generation time (≈15 minutes per sample) and relies only on open-source components. Empirical results show improved geometry, motion coherence, and semantic fidelity compared with prior methods, validated by ablations and user studies."

Abstract

Previous text-to-4D methods have leveraged multiple Score Distillation Sampling (SDS) techniques, combining motion priors from video-based diffusion models (DMs) with geometric priors from multiview DMs to implicitly guide 4D renderings. However, differences in these priors result in conflicting gradient directions during optimization, causing trade-offs between motion fidelity and geometry accuracy, and requiring substantial optimization time to reconcile the models. In this paper, we introduce \textbf{P}ixel-\textbf{L}evel \textbf{A}lignment for text-driven \textbf{4D} Gaussian splatting (PLA4D) to resolve this motion-geometry conflict. PLA4D provides an anchor reference, i.e., text-generated video, to align the rendering process conditioned by different DMs in pixel space. For static alignment, our approach introduces a focal alignment method and Gaussian-Mesh contrastive learning to iteratively adjust focal lengths and provide explicit geometric priors at each timestep. At the dynamic level, a motion alignment technique and T-MV refinement method are employed to enforce both pose alignment and motion continuity across unknown viewpoints, ensuring intrinsic geometric consistency across views. With such pixel-level multi-DM alignment, our PLA4D framework is able to generate 4D objects with superior geometric, motion, and semantic consistency. Fully implemented with open-source tools, PLA4D offers an efficient and accessible solution for high-quality 4D digital content creation with significantly reduced generation time.
Paper Structure (9 sections, 14 equations, 8 figures, 2 tables)

This paper contains 9 sections, 14 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: 4D objects generated by PLA4D. PLA4D produces 4D content with geometric consistency and smooth, video-like motion that aligns precisely with the text prompt, within a rapid 15-minute processing time.
  • Figure 2: Without offering an anchor reference in pixel space, multiple SDS align each rendering to their respective priors, which may not be consistent across different diffusion model priors, requiring significant time for reconciliation to generate a 4D result. With the anchor reference in pixel space, however, each SDS can optimize the 4D geometry and motion representation according to its respective prior more effectively.
  • Figure 3: Pipeline of PLA4D, which leverages text as the condition and text-generated video as an anchor for 4D generation. (a) Static alignment: We propose focal alignment to search for the best focal length for 4D automatically. We also introduce Gaussian-Mesh Contrastive Learning to provide geometric information for 4D Gaussian in unknown views, explicitly leveraging the geometric priors of the mesh. (b) Dynamic alignment: Across multiple frames, we introduce motion alignment to guide the 4D object's motion following the anchor video. Furthermore, we propose Time-Multiview (T-MV) refinement to optimize the motion and quality of the 4D object's unknown viewpoints, using the prior and the condition of the model that generates the video.
  • Figure 4: Focal Alignment
  • Figure 5: Gaussian-Mesh Contrastive Learning
  • ...and 3 more figures