Table of Contents
Fetching ...

SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

Asrar Alruwayqi

TL;DR

SHaDe tackles dynamic 3D reconstruction from sparse multi-view imagery by combining three innovations: an explicit tri-plane deformation field, a canonical radiance field with time-conditioned spherical harmonics attention, and a transformer-guided latent diffusion prior. The tri-plane deformation provides a learnable but MLP-free motion prior, the SH-attention decoder enables compact, view-dependent color with dynamic appearance, and the latent diffusion module enhances temporal coherence and robustness under ambiguous motion. Trained in two stages, SHaDe first pretrains the diffusion module and then jointly optimizes all components with reconstruction, denoising, and temporal losses, achieving state-of-the-art results on synthetic benchmarks like D-NeRF and surpassing HexPlane and 4D Gaussian Splatting in quality and consistency. The framework is efficient, memory-friendly, and robust to sparse views, offering a practical pathway toward scalable 4D reconstruction and potential extensions such as editable latent representations and dynamic scene stylization.

Abstract

We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.

SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

TL;DR

SHaDe tackles dynamic 3D reconstruction from sparse multi-view imagery by combining three innovations: an explicit tri-plane deformation field, a canonical radiance field with time-conditioned spherical harmonics attention, and a transformer-guided latent diffusion prior. The tri-plane deformation provides a learnable but MLP-free motion prior, the SH-attention decoder enables compact, view-dependent color with dynamic appearance, and the latent diffusion module enhances temporal coherence and robustness under ambiguous motion. Trained in two stages, SHaDe first pretrains the diffusion module and then jointly optimizes all components with reconstruction, denoising, and temporal losses, achieving state-of-the-art results on synthetic benchmarks like D-NeRF and surpassing HexPlane and 4D Gaussian Splatting in quality and consistency. The framework is efficient, memory-friendly, and robust to sparse views, offering a practical pathway toward scalable 4D reconstruction and potential extensions such as editable latent representations and dynamic scene stylization.

Abstract

We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.

Paper Structure

This paper contains 31 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visualization of our 4D scene reconstruction across time. From left to right, we show reconstructed frames of a dynamic subject at increasing motion timestamps. The rightmost frame corresponds to the canonical configuration into which all motion is explicitly warped for consistent appearance modeling. Our method preserves structural integrity and appearance fidelity over time, even under significant non-rigid deformation.
  • Figure 2: Overview of our dynamic scene reconstruction framework. (A) Sparse multi-view inputs at different timesteps are provided as input. (B) A tri-plane feature volume encodes spatial and temporal information across three orthogonal planes ($F_{xy}, F_{yz}, F_{xz}$). (C) These features are tokenized and passed through a transformer encoder, producing a latent vector $\mathbf{z}$ refined via a latent diffusion model (Refinement Path). The decoder reconstructs enhanced tri-plane features $\hat{\mathcal{F}}$ and deformation offsets $\hat{\Delta}$. (D) In parallel (Rendering Path), the original tri-plane features are used to compute a deformation offset $\Delta \mathbf{x}$, which warps query points into canonical space. SH coefficients are retrieved, and attention weights $\alpha_{lm}(\mathbf{d}, t)$ are applied over SH basis functions $Y_{lm}(\mathbf{d})$. (E) The view- and time-aware SH composition yields color, while volume density $\sigma$ is predicted from a separate tri-plane. These outputs are used for differentiable volume rendering to produce the final photorealistic output $\hat{I}$.
  • Figure 3: Reconstruction quality under varying input sparsity. We compare PSNR values for our method, HexPlane HexPlane, and 4D Gaussian Splatting 4dgs across increasing numbers of input views (3, 5, 10, 20). Our method retains high fidelity even under extreme view sparsity, demonstrating strong generalization and robustness to limited observations.
  • Figure 4: Qualitative comparison on the Jumping Jacks, Stand Up, and Lego scenes from the D-NeRF benchmark. Our method produces sharper, more temporally stable reconstructions compared to HexPlane HexPlane and 4D Gaussian Splatting 4dgs.