Table of Contents
Fetching ...

STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

Sunghun Yang, Minhyeok Lee, Suhwan Cho, Jungho Lee, Sangyoun Lee

TL;DR

This work tackles temporal inconsistency in video monocular depth estimation without relying on external motion cues. It introduces STATIC, which splits frames into static and dynamic regions via a surface-normal-based difference mask and then learns temporal cues independently through the Masked Static (MS) and Surface Normal Similarity (SNS) modules, followed by a refinement that fuses the two paths. Key contributions include a differentiable, geometry-driven difference mask, the MS and SNS modules tailored to static/dynamic regions, and state-of-the-art results on KITTI Eigen and NYUv2 without extra inputs, validated by extensive ablations. The approach improves edge fidelity and temporal stability while maintaining efficiency, making it practically impactful for autonomous driving and robotics where reliable video depth is essential.

Abstract

Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.

STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

TL;DR

This work tackles temporal inconsistency in video monocular depth estimation without relying on external motion cues. It introduces STATIC, which splits frames into static and dynamic regions via a surface-normal-based difference mask and then learns temporal cues independently through the Masked Static (MS) and Surface Normal Similarity (SNS) modules, followed by a refinement that fuses the two paths. Key contributions include a differentiable, geometry-driven difference mask, the MS and SNS modules tailored to static/dynamic regions, and state-of-the-art results on KITTI Eigen and NYUv2 without extra inputs, validated by extensive ablations. The approach improves edge fidelity and temporal stability while maintaining efficiency, making it practically impactful for autonomous driving and robotics where reliable video depth is essential.

Abstract

Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.

Paper Structure

This paper contains 31 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Results of the (a) Depth Anything v2 yang2024depth, (b) GEDepth yang2023gedepth and proposed (c) STATIC from sequential frames. (c) improves ground continuity over the single-frame approach (a) and achieves better temporal consistency in object shape and depth compared to other video depth estimation method (b). For visual comparison, each method is rescaled.
  • Figure 2: Overall architecture of the proposed STATIC model. The model primarily consists of an encoder, depth embedder, and video modules, with a surface normal decoder and head as submodules.
  • Figure 3: The process of generating the difference mask. The difference mask is updated at each step from (a) to (c). Step (a) involves pixel-wise variance calculation, (b) applies thresholding, and (c) refines the results using a pseudo-labeling process. The final mask $M^l$ is utilized within the model.
  • Figure 4: The structure of the SNS module. First, a similarity map $S$ is generated using two features. Then, by multiplying this map with the depth features of other frames, a process similar to warping is performed.
  • Figure 5: The structure of the MS module. First, an attention mechanism is applied using masked features that retain only the static area. Next, a refinement process is conducted to integrate the aligned feature with the depth feature.
  • ...and 7 more figures