NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, Bing Wang
TL;DR
NormalCrafter tackles the challenge of temporally coherent video surface normal estimation by leveraging priors from video diffusion models. It introduces semantic feature regularization (SFR) to align diffusion features with semantic cues and a two-stage training protocol that decouples long-range temporal reasoning in the latent space from precise spatial refinement in the pixel space. The approach achieves state-of-the-art results on open-world video normal estimation in zero-shot settings, validated across multiple datasets and reinforced by ablations that demonstrate the value of SFR and the two-stage strategy. By delivering stable, detailed normals for unconstrained videos, NormalCrafter has potential applications in 3D reconstruction, relighting, and video editing.
Abstract
Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.
