SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion
Asrar Alruwayqi
TL;DR
SHaDe tackles dynamic 3D reconstruction from sparse multi-view imagery by combining three innovations: an explicit tri-plane deformation field, a canonical radiance field with time-conditioned spherical harmonics attention, and a transformer-guided latent diffusion prior. The tri-plane deformation provides a learnable but MLP-free motion prior, the SH-attention decoder enables compact, view-dependent color with dynamic appearance, and the latent diffusion module enhances temporal coherence and robustness under ambiguous motion. Trained in two stages, SHaDe first pretrains the diffusion module and then jointly optimizes all components with reconstruction, denoising, and temporal losses, achieving state-of-the-art results on synthetic benchmarks like D-NeRF and surpassing HexPlane and 4D Gaussian Splatting in quality and consistency. The framework is efficient, memory-friendly, and robust to sparse views, offering a practical pathway toward scalable 4D reconstruction and potential extensions such as editable latent representations and dynamic scene stylization.
Abstract
We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.
