Table of Contents
Fetching ...

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, Bing Wang

TL;DR

NormalCrafter tackles the challenge of temporally coherent video surface normal estimation by leveraging priors from video diffusion models. It introduces semantic feature regularization (SFR) to align diffusion features with semantic cues and a two-stage training protocol that decouples long-range temporal reasoning in the latent space from precise spatial refinement in the pixel space. The approach achieves state-of-the-art results on open-world video normal estimation in zero-shot settings, validated across multiple datasets and reinforced by ablations that demonstrate the value of SFR and the two-stage strategy. By delivering stable, detailed normals for unconstrained videos, NormalCrafter has potential applications in 3D reconstruction, relighting, and video editing.

Abstract

Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

TL;DR

NormalCrafter tackles the challenge of temporally coherent video surface normal estimation by leveraging priors from video diffusion models. It introduces semantic feature regularization (SFR) to align diffusion features with semantic cues and a two-stage training protocol that decouples long-range temporal reasoning in the latent space from precise spatial refinement in the pixel space. The approach achieves state-of-the-art results on open-world video normal estimation in zero-shot settings, validated across multiple datasets and reinforced by ablations that demonstrate the value of SFR and the two-stage strategy. By delivering stable, detailed normals for unconstrained videos, NormalCrafter has potential applications in 3D reconstruction, relighting, and video editing.

Abstract

Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

Paper Structure

This paper contains 12 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We innovate NormalCrafter, a novel video normal estimation model, that can generate temporally consistent normal sequences with fine-grained details from open-world videos with arbitrary lengths. Compared to results from state-of-the-art image normal estimators, Marigold-E2E-FT E2E, our results exhibit both higher spatial fidelity and temporal consistency, as shown in the frame visualizations and temporal profiles (marked by the red lines and rectangles).
  • Figure 2: Naively repurposing video diffusion models, e.g. SVD blattmann2023stable, for normal estimation (Ours w/o SFG) produces over-smoothed predictions, due to insufficient high-level semantic cues in SVD features. By leveraging Semantic Feature Regularization (SFR) to align diffusion features with DINO caron2021emerging, our approach yields sharper and more fine-grained normal predictions.
  • Figure 3: Overview of our NormalCrafter. We model the video normal estimation task with a video diffusion model conditioned on input RGB frames. We propose Semantic Feature Regularization (SFR) $\mathcal{L}_{\text{reg}}$ to align the diffusion features with robust semantic representations from DINO encoder, encouraging the model to concentrate on the intrinsic semantics for accurate and detailed normal estimation. Our training protocol consists of two stages: 1) training the entire U-Net in the latent space with diffusion score matching $\mathcal{L}_{\text{DSM}}$ and SFR $\mathcal{L}_{\text{reg}}$; 2) fine-tuning only the spatial layers in pixel space with angular loss $\mathcal{L}_{\text{angular}}$ and SFR $\mathcal{L}_{\text{reg}}$.
  • Figure 4: Qualitative comparisons. The input videos are sampled from the DAVIS dataset davis_2019 and Sora-generated videos. To highlight the temporal consistency, the y-t slices at the designated red line positions are displayed in red boxes.
  • Figure 5: Ablation results with Semantic Feature Regularization (SFR). Red boxes highlight the significant differences.
  • ...and 1 more figures