Table of Contents
Fetching ...

SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

Pardis Taghavi, Reza Langari, Gaurav Pandey

TL;DR

An innovative multitask learning framework that allows concurrent depth estimation and semantic segmentation using a single camera, based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency is presented.

Abstract

This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{https://github.com/PardisTaghavi/SwinMTL}.

SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

TL;DR

An innovative multitask learning framework that allows concurrent depth estimation and semantic segmentation using a single camera, based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency is presented.

Abstract

This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{https://github.com/PardisTaghavi/SwinMTL}.
Paper Structure (12 sections, 10 equations, 5 figures, 5 tables)

This paper contains 12 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Unified Vision: This figure provides a general overview of the seamless integration designed for joint semantic segmentation and depth estimation, ultimately contributing to the creation of a 3D scene.
  • Figure 2: Network Architecture: Figure shows the detailed network architecture tailored for the precise prediction of both depth maps and segmentation maps. Notably, the network employs two critic discriminators during training, enhancing its ability to discern and refine the joint output representing depth and segmentation information.
  • Figure 3: The first row shows a stereo pair from the Cityscapes dataset. The second row illustrates the dense disparity map (bottom right) and the dense depth map (bottom left), both generated using the advanced Cascaded Recurrent Stereo Matching Network (CREStereo) li2022practical
  • Figure 4: Histogram of depth values of an instance in the Cityscapes dataset, visually highlighting the dominance of values within the $0-10$ meter range. The distribution's peak and long tail motivate the logarithmic transformation.
  • Figure 5: Exploring SwinMTL's Qualitative Results on the Cityscapes Dataset