Table of Contents
Fetching ...

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli

TL;DR

FutureDepth addresses the challenge of accurate and temporally stable video depth estimation by introducing a future-oriented learning paradigm. It combines a Future Prediction Network ($F$-Net) that auto-regressively predicts multi-step future feature volumes and a Reconstruction Network ($R$-Net) that performs adaptively masked auto-encoding on multi-frame features, producing motion and scene queries. These queries guide a depth decoder via cross-attention and a refinement module to achieve state-of-the-art results on NYUDv2, KITTI, DDAD, and Sintel while maintaining efficiency close to monocular methods. The approach demonstrates that forecasting future frame features and exploiting cross-frame correspondences can dramatically improve dense depth with strong generalization across diverse, open-domain scenes.

Abstract

In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

TL;DR

FutureDepth addresses the challenge of accurate and temporally stable video depth estimation by introducing a future-oriented learning paradigm. It combines a Future Prediction Network (-Net) that auto-regressively predicts multi-step future feature volumes and a Reconstruction Network (-Net) that performs adaptively masked auto-encoding on multi-frame features, producing motion and scene queries. These queries guide a depth decoder via cross-attention and a refinement module to achieve state-of-the-art results on NYUDv2, KITTI, DDAD, and Sintel while maintaining efficiency close to monocular methods. The approach demonstrates that forecasting future frame features and exploiting cross-frame correspondences can dramatically improve dense depth with strong generalization across diverse, open-domain scenes.

Abstract

In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models
Paper Structure (23 sections, 6 equations, 19 figures, 11 tables, 3 algorithms)

This paper contains 23 sections, 6 equations, 19 figures, 11 tables, 3 algorithms.

Figures (19)

  • Figure 1: FutureDepth vs. existing SOTA in terms of depth accuracy (RMSE), temporal consistency (OPW wang2023neural), and runtime (on NVIDIA RTX-3080 GPU), on KITTI. We compare with monocular methods: NeWCRFs yuan2022newcrfs, iDisc piccinelli2023idisc, and GEDepth yang2023gedepth, cost-volume-based methods: ManyDepth watson2021temporal and TC-Depth ruhkamp2021attention (both fully supervised here), and video-based methods: MAMo yasarla2023mamo and NVDS wang2023neural. FutureDepth outperforms existing methods in terms of both accuracy and temporal consistency, and runs efficiently.
  • Figure 2: Our proposed FutureDepth method. Features of consecutive frames are extracted by the encoder and fed to the Future Prediction Network (F-Net) and Reconstruction Network (R-Net), which are trained using iterative future prediction and adaptive masked auto-encoding, respectively. At inference, features generated by F-Net and R-Net, $Q_{motion,1,T}$ and $Q_{scene,1,T}$, which contain key motion and correspondence cues, are integrated into the depth decoding process. Furthermore, these features are also utilized in a refinement process to improve the final depth map quality.
  • Figure 3: Future Prediction Network (F-Net).
  • Figure 4: Example motion query $Q_{motion,1,T}$ generated using future prediction network. Here we show three example channels in $Q_{motion,1,T}$, with $T=4$, and $L=6$.
  • Figure 5: Generated masks over a sample input set of frames. Masks generated by adaptive sampler focuses on important objects like van, cars, tram, bus, railway tracks, and road boundaries etc. across the frames.
  • ...and 14 more figures