STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation
Sunghun Yang, Minhyeok Lee, Suhwan Cho, Jungho Lee, Sangyoun Lee
TL;DR
This work tackles temporal inconsistency in video monocular depth estimation without relying on external motion cues. It introduces STATIC, which splits frames into static and dynamic regions via a surface-normal-based difference mask and then learns temporal cues independently through the Masked Static (MS) and Surface Normal Similarity (SNS) modules, followed by a refinement that fuses the two paths. Key contributions include a differentiable, geometry-driven difference mask, the MS and SNS modules tailored to static/dynamic regions, and state-of-the-art results on KITTI Eigen and NYUv2 without extra inputs, validated by extensive ablations. The approach improves edge fidelity and temporal stability while maintaining efficiency, making it practically impactful for autonomous driving and robotics where reliable video depth is essential.
Abstract
Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.
