Table of Contents
Fetching ...

A Concise but High-performing Network for Image Guided Depth Completion in Autonomous Driving

Moyun Liu, Bing Chen, Youping Chen, Jingming Xie, Lei Yao, Yang Zhang, Joey Tianyi Zhou

TL;DR

This work tackles real-time RGB-guided depth completion for autonomous driving by introducing CHNet, a lightweight dual-encoder, single-decoder network. It features a fast guidance module that efficiently fuses RGB semantics into depth features and a decoupled prediction head that separately optimizes observed and unobserved pixel regions to mitigate optimization mismatch. Empirical results on KITTI and NYUv2 demonstrate strong RMSE/MAE performance with superior inference speed compared to many state-of-the-art methods. The approach yields a practical, high-accuracy depth completion solution suitable for real-time autonomous perception.

Abstract

Depth completion is a crucial task in autonomous driving, aiming to convert a sparse depth map into a dense depth prediction. Due to its potentially rich semantic information, RGB image is commonly fused to enhance the completion effect. Image-guided depth completion involves three key challenges: 1) how to effectively fuse the two modalities; 2) how to better recover depth information; and 3) how to achieve real-time prediction for practical autonomous driving. To solve the above problems, we propose a concise but effective network, named CENet, to achieve high-performance depth completion with a simple and elegant structure. Firstly, we use a fast guidance module to fuse the two sensor features, utilizing abundant auxiliary features extracted from the color space. Unlike other commonly used complicated guidance modules, our approach is intuitive and low-cost. In addition, we find and analyze the optimization inconsistency problem for observed and unobserved positions, and a decoupled depth prediction head is proposed to alleviate the issue. The proposed decoupled head can better output the depth of valid and invalid positions with very few extra inference time. Based on the simple structure of dual-encoder and single-decoder, our CENet can achieve superior balance between accuracy and efficiency. In the KITTI depth completion benchmark, our CENet attains competitive performance and inference speed compared with the state-of-the-art methods. To validate the generalization of our method, we also evaluate on indoor NYUv2 dataset, and our CENet still achieve impressive results. The code of this work will be available at https://github.com/lmomoy/CHNet.

A Concise but High-performing Network for Image Guided Depth Completion in Autonomous Driving

TL;DR

This work tackles real-time RGB-guided depth completion for autonomous driving by introducing CHNet, a lightweight dual-encoder, single-decoder network. It features a fast guidance module that efficiently fuses RGB semantics into depth features and a decoupled prediction head that separately optimizes observed and unobserved pixel regions to mitigate optimization mismatch. Empirical results on KITTI and NYUv2 demonstrate strong RMSE/MAE performance with superior inference speed compared to many state-of-the-art methods. The approach yields a practical, high-accuracy depth completion solution suitable for real-time autonomous perception.

Abstract

Depth completion is a crucial task in autonomous driving, aiming to convert a sparse depth map into a dense depth prediction. Due to its potentially rich semantic information, RGB image is commonly fused to enhance the completion effect. Image-guided depth completion involves three key challenges: 1) how to effectively fuse the two modalities; 2) how to better recover depth information; and 3) how to achieve real-time prediction for practical autonomous driving. To solve the above problems, we propose a concise but effective network, named CENet, to achieve high-performance depth completion with a simple and elegant structure. Firstly, we use a fast guidance module to fuse the two sensor features, utilizing abundant auxiliary features extracted from the color space. Unlike other commonly used complicated guidance modules, our approach is intuitive and low-cost. In addition, we find and analyze the optimization inconsistency problem for observed and unobserved positions, and a decoupled depth prediction head is proposed to alleviate the issue. The proposed decoupled head can better output the depth of valid and invalid positions with very few extra inference time. Based on the simple structure of dual-encoder and single-decoder, our CENet can achieve superior balance between accuracy and efficiency. In the KITTI depth completion benchmark, our CENet attains competitive performance and inference speed compared with the state-of-the-art methods. To validate the generalization of our method, we also evaluate on indoor NYUv2 dataset, and our CENet still achieve impressive results. The code of this work will be available at https://github.com/lmomoy/CHNet.
Paper Structure (17 sections, 9 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) and (b) represent the RGB image and the corresponding sparse depth map, respectively. Compared with the RGB image, the sparse depth map scanned by LiDAR is too sparse to reflect the surroundings.
  • Figure 2: Compared with other methods, which are marked as blue, our CHNet achieves superior performance in terms of accuracy and efficiency on the KITTI depth completion benchmark uhrig2017sparsity. More comparison details can be found in Table \ref{['tab:sota']}. Note that the smaller values of root mean squared error (RMSE) and inference time represent better performance.
  • Figure 3: The structure of our CHNet contains two encoders for different modalities and one decoder to output prediction. A decoupled prediction head is connected with the decoder to obtain depth for unobserved and observed positions, respectively. There are skip connections between each output of the depth branch in the encoder and the corresponding features in the decoder, and they are not shown in this figure for simplicity. Different sizes and shapes of blocks correspond to variations in the size and channels of the feature maps they operate on.
  • Figure 4: The illustration of our fast guidance module.
  • Figure 5: The illustration of optimization inconsistency problem for unobserved and observed positions. We use gray color denotes the unobserved positions, and the degree of orange color represents the depth value. (a) shows the traditional networks assign same parameters to all positions, whether they are observed or not. (b) reveals the optimization inconsistency will incur suboptimal prediction result. (c) is used to prove the effectiveness of decoupled prediction head.
  • ...and 5 more figures