FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Rajeev Yasarla; Manish Kumar Singh; Hong Cai; Yunxiao Shi; Jisoo Jeong; Yinhao Zhu; Shizhong Han; Risheek Garrepalli; Fatih Porikli

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli

TL;DR

FutureDepth addresses the challenge of accurate and temporally stable video depth estimation by introducing a future-oriented learning paradigm. It combines a Future Prediction Network ($F$-Net) that auto-regressively predicts multi-step future feature volumes and a Reconstruction Network ($R$-Net) that performs adaptively masked auto-encoding on multi-frame features, producing motion and scene queries. These queries guide a depth decoder via cross-attention and a refinement module to achieve state-of-the-art results on NYUDv2, KITTI, DDAD, and Sintel while maintaining efficiency close to monocular methods. The approach demonstrates that forecasting future frame features and exploiting cross-frame correspondences can dramatically improve dense depth with strong generalization across diverse, open-domain scenes.

Abstract

In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

TL;DR

FutureDepth addresses the challenge of accurate and temporally stable video depth estimation by introducing a future-oriented learning paradigm. It combines a Future Prediction Network (

-Net) that auto-regressively predicts multi-step future feature volumes and a Reconstruction Network (

-Net) that performs adaptively masked auto-encoding on multi-frame features, producing motion and scene queries. These queries guide a depth decoder via cross-attention and a refinement module to achieve state-of-the-art results on NYUDv2, KITTI, DDAD, and Sintel while maintaining efficiency close to monocular methods. The approach demonstrates that forecasting future frame features and exploiting cross-frame correspondences can dramatically improve dense depth with strong generalization across diverse, open-domain scenes.

Abstract

Paper Structure (23 sections, 6 equations, 19 figures, 11 tables, 3 algorithms)

This paper contains 23 sections, 6 equations, 19 figures, 11 tables, 3 algorithms.

Introduction
Proposed Approach: FutureDepth
Future Prediction Network (F-Net)
Reconstruction Network (R-Net)
Using the F-Net and R-Net Features
Training
Experiments
Implementation and Experiment Setup
Datasets
Main Evaluation Results
Ablation Study
Related work
Conclusion
Additional Ablation Studies
What happens when the adaptive sampler is used during inference?
...and 8 more sections

Figures (19)

Figure 1: FutureDepth vs. existing SOTA in terms of depth accuracy (RMSE), temporal consistency (OPW wang2023neural), and runtime (on NVIDIA RTX-3080 GPU), on KITTI. We compare with monocular methods: NeWCRFs yuan2022newcrfs, iDisc piccinelli2023idisc, and GEDepth yang2023gedepth, cost-volume-based methods: ManyDepth watson2021temporal and TC-Depth ruhkamp2021attention (both fully supervised here), and video-based methods: MAMo yasarla2023mamo and NVDS wang2023neural. FutureDepth outperforms existing methods in terms of both accuracy and temporal consistency, and runs efficiently.
Figure 2: Our proposed FutureDepth method. Features of consecutive frames are extracted by the encoder and fed to the Future Prediction Network (F-Net) and Reconstruction Network (R-Net), which are trained using iterative future prediction and adaptive masked auto-encoding, respectively. At inference, features generated by F-Net and R-Net, $Q_{motion,1,T}$ and $Q_{scene,1,T}$, which contain key motion and correspondence cues, are integrated into the depth decoding process. Furthermore, these features are also utilized in a refinement process to improve the final depth map quality.
Figure 3: Future Prediction Network (F-Net).
Figure 4: Example motion query $Q_{motion,1,T}$ generated using future prediction network. Here we show three example channels in $Q_{motion,1,T}$, with $T=4$, and $L=6$.
Figure 5: Generated masks over a sample input set of frames. Masks generated by adaptive sampler focuses on important objects like van, cars, tram, bus, railway tracks, and road boundaries etc. across the frames.
...and 14 more figures

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

TL;DR

Abstract

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (19)