Table of Contents
Fetching ...

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

Zhengfei Kuang, Tianyuan Zhang, Kai Zhang, Hao Tan, Sai Bi, Yiwei Hu, Zexiang Xu, Milos Hasan, Gordon Wetzstein, Fujun Luan

Abstract

We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

Abstract

We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.

Paper Structure

This paper contains 20 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Buffer Anytime improves temporal consistency in video geometry estimation without paired training data. Top: Comparison of depth estimation between Depth Anything V2 DepthAnythingV2 and our method on a challenging dynamic scene with lighting variations. While the original model shows inconsistent depth predictions across frames, our approach maintains stable depth estimates. Bottom: Surface normal estimation comparison between Marigold-E2E-FT E2EFT and our method on an outdoor scene with complex geometry. Our method preserves consistent normal maps across frames while maintaining accurate geometric details. In both cases, our method achieves better temporal consistency without requiring video--geometry paired training data.
  • Figure 2: Visualization of Our Training Pipeline. Our pipeline consists of three branches: an optical flow network that extracts optical flow from input video to guide temporal smoothness; a fixed single-frame image model for regularization, and the trained video model that integrates a fine-tuned image backbone with temporal layers.
  • Figure 3: Illustration of our masking procedure for the optical flow loss.Row 1: Given two adjacent frames, we first apply cycle validation on the predicted optical flows to filter out the outliers; Row 2: We then apply an edge detection procedure on the predicted depth map to mask out the boundaries. Row 3: The combination of two masks diminish the effect of inaccurate optical flow prediction to the smoothness error map.
  • Figure 4: Our Network Architecture. We present two model architectures for video geometry estimation: (a) A depth estimation model based on Depth Anything V2 DepthAnythingV2, where we inject temporal blocks between fusion layers while keeping the ViT backbone frozen. The model processes video frames $(B,T,3,H,W)$ through a patchify layer, multiple ViT blocks with reassemble and fusion operations, and temporal blocks to produce depth maps $(B,T,H,W)$. (b) A normal estimation model built upon Marigold-E2E-FT E2EFT, where we insert temporal blocks between spatial layers in the diffusion U-Net. The model takes RGB video frames as input, processes them through an encoder to obtain latent maps, combines them with zero noise maps, and processes through the U-Net with alternating spatial and temporal blocks to generate normal maps $(B,T,3,H,W)$. Blue blocks are fixed during training, green blocks are fine-tuned, and pink blocks are trained from scratch with zero initialization.
  • Figure 5: Qualitative comparison on Video Depth Estimation. For better visualization, we also show the time slice on the red lines of each video on their right side. Our model keeps the structure details shown in the image model results while achieving smoother performance on the time axis.
  • ...and 1 more figures