PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting
Qiaowei Miao, JinSheng Quan, Kehan Li, Yawei Luo
TL;DR
PLA4D tackles the conflict between motion and geometry priors in text-to-4D diffusion-guided synthesis by establishing a pixel-level anchor through a text-generated video. It introduces static and dynamic alignment pipelines, including focal alignment, Gaussian-Mesh contrastive learning, motion alignment, and Time-Multiview refinement, all built on a 4D Gaussian-splatting representation. The approach yields video-like, geometrically consistent 4D objects with substantially reduced generation time (≈15 minutes per sample) and relies only on open-source components. Empirical results show improved geometry, motion coherence, and semantic fidelity compared with prior methods, validated by ablations and user studies."
Abstract
Previous text-to-4D methods have leveraged multiple Score Distillation Sampling (SDS) techniques, combining motion priors from video-based diffusion models (DMs) with geometric priors from multiview DMs to implicitly guide 4D renderings. However, differences in these priors result in conflicting gradient directions during optimization, causing trade-offs between motion fidelity and geometry accuracy, and requiring substantial optimization time to reconcile the models. In this paper, we introduce \textbf{P}ixel-\textbf{L}evel \textbf{A}lignment for text-driven \textbf{4D} Gaussian splatting (PLA4D) to resolve this motion-geometry conflict. PLA4D provides an anchor reference, i.e., text-generated video, to align the rendering process conditioned by different DMs in pixel space. For static alignment, our approach introduces a focal alignment method and Gaussian-Mesh contrastive learning to iteratively adjust focal lengths and provide explicit geometric priors at each timestep. At the dynamic level, a motion alignment technique and T-MV refinement method are employed to enforce both pose alignment and motion continuity across unknown viewpoints, ensuring intrinsic geometric consistency across views. With such pixel-level multi-DM alignment, our PLA4D framework is able to generate 4D objects with superior geometric, motion, and semantic consistency. Fully implemented with open-source tools, PLA4D offers an efficient and accessible solution for high-quality 4D digital content creation with significantly reduced generation time.
