Table of Contents
Fetching ...

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

JunDa Cheng, Wei Yin, Kaixuan Wang, Xiaozhi Chen, Shijie Wang, Xin Yang

TL;DR

This work tackles depth estimation for autonomous driving under noisy camera poses by introducing AFNet, a two-branch network that fuses single-view and multi-view depth predictions with an adaptive fusion module. A warping-based confidence M_w, alongside branch confidences M_s and M_m, enables robust, pixel-wise fusion that gracefully handles textureless regions, dynamic objects, and pose errors. Empirical results on KITTI and DDAD show state-of-the-art accuracy and, crucially, superior robustness under pose perturbations, with a pose-correction variant further boosting performance under challenging noise. The approach advances practical depth perception for autonomous systems by balancing accuracy and resilience in real-world conditions.

Abstract

Multi-view depth estimation has achieved impressive performance over various benchmarks. However, almost all current multi-view systems rely on given ideal camera poses, which are unavailable in many real-world scenarios, such as autonomous driving. In this work, we propose a new robustness benchmark to evaluate the depth estimation system under various noisy pose settings. Surprisingly, we find current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. To address this challenge, we propose a single-view and multi-view fused depth estimation system, which adaptively integrates high-confident multi-view and single-view results for both robust and accurate depth estimations. The adaptive fusion module performs fusion by dynamically selecting high-confidence regions between two branches based on a wrapping confidence map. Thus, the system tends to choose the more reliable branch when facing textureless scenes, inaccurate calibration, dynamic objects, and other degradation or challenging conditions. Our method outperforms state-of-the-art multi-view and fusion methods under robustness testing. Furthermore, we achieve state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when given accurate pose estimations. Project website: https://github.com/Junda24/AFNet/.

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

TL;DR

This work tackles depth estimation for autonomous driving under noisy camera poses by introducing AFNet, a two-branch network that fuses single-view and multi-view depth predictions with an adaptive fusion module. A warping-based confidence M_w, alongside branch confidences M_s and M_m, enables robust, pixel-wise fusion that gracefully handles textureless regions, dynamic objects, and pose errors. Empirical results on KITTI and DDAD show state-of-the-art accuracy and, crucially, superior robustness under pose perturbations, with a pose-correction variant further boosting performance under challenging noise. The approach advances practical depth perception for autonomous systems by balancing accuracy and resilience in real-world conditions.

Abstract

Multi-view depth estimation has achieved impressive performance over various benchmarks. However, almost all current multi-view systems rely on given ideal camera poses, which are unavailable in many real-world scenarios, such as autonomous driving. In this work, we propose a new robustness benchmark to evaluate the depth estimation system under various noisy pose settings. Surprisingly, we find current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. To address this challenge, we propose a single-view and multi-view fused depth estimation system, which adaptively integrates high-confident multi-view and single-view results for both robust and accurate depth estimations. The adaptive fusion module performs fusion by dynamically selecting high-confidence regions between two branches based on a wrapping confidence map. Thus, the system tends to choose the more reliable branch when facing textureless scenes, inaccurate calibration, dynamic objects, and other degradation or challenging conditions. Our method outperforms state-of-the-art multi-view and fusion methods under robustness testing. Furthermore, we achieve state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when given accurate pose estimations. Project website: https://github.com/Junda24/AFNet/.
Paper Structure (21 sections, 6 equations, 4 figures, 11 tables)

This paper contains 21 sections, 6 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Visualization of reconstructed 3D point clouds of DDAD godard2019digging scenes. We fuse the results of 10 frames (including the dynamic object cars) and zoom in on some details for visualization. It shows that our method can achieve high-quality results on both static and dynamic parts.
  • Figure 2: Overview of the AFNet, which consists of three parts: single-view branch, multi-view branch, and the adaptive fusion (AF) module. Two branches share the feature extraction network and have their own prediction and confidence map, i.e. $\boldsymbol{d}_{s}$, $\boldsymbol{M}_{s}$, $\boldsymbol{d}_{m}$ and $\boldsymbol{M}_{m}$, and then fused by the AF module to obtain the final accurate and robust prediction $\boldsymbol{d}_{fuse}$. The green background in AF module represents the outputs of the single-view branch and multi-view branch.
  • Figure 3: Qualitative results on DDAD godard2019digging test set. Black ellipses highlight obvious improvements achieved by our method.
  • Figure 4: Visualization comparison results on DDAD godard2019digging. The black boxes show the robustness of our AFNet. With the gradual increases of pose noise, the accuracy of wang2022itermvs which is mainly based on multi-view matching decreased dramatically, while we remained stable.