Table of Contents
Fetching ...

Continual Road-Scene Semantic Segmentation via Feature-Aligned Symmetric Multi-Modal Network

Francesco Barbato, Elena Camuffo, Simone Milani, Pietro Zanuttigh

TL;DR

This work re-frame the task of multimodal semantic segmentation by enforcing a tightly coupled feature representation and a symmetric information-sharing scheme, which allows the approach to work even when one of the input modalities is missing.

Abstract

State-of-the-art multimodal semantic segmentation strategies combining LiDAR and color data are usually designed on top of asymmetric information-sharing schemes and assume that both modalities are always available. This strong assumption may not hold in real-world scenarios, where sensors are prone to failure or can face adverse conditions that make the acquired information unreliable. This problem is exacerbated when continual learning scenarios are considered since they have stringent data reliability constraints. In this work, we re-frame the task of multimodal semantic segmentation by enforcing a tightly coupled feature representation and a symmetric information-sharing scheme, which allows our approach to work even when one of the input modalities is missing. We also introduce an ad-hoc class-incremental continual learning scheme, proving our approach's effectiveness and reliability even in safety-critical settings, such as autonomous driving. We evaluate our approach on the SemanticKITTI dataset, achieving impressive performances.

Continual Road-Scene Semantic Segmentation via Feature-Aligned Symmetric Multi-Modal Network

TL;DR

This work re-frame the task of multimodal semantic segmentation by enforcing a tightly coupled feature representation and a symmetric information-sharing scheme, which allows the approach to work even when one of the input modalities is missing.

Abstract

State-of-the-art multimodal semantic segmentation strategies combining LiDAR and color data are usually designed on top of asymmetric information-sharing schemes and assume that both modalities are always available. This strong assumption may not hold in real-world scenarios, where sensors are prone to failure or can face adverse conditions that make the acquired information unreliable. This problem is exacerbated when continual learning scenarios are considered since they have stringent data reliability constraints. In this work, we re-frame the task of multimodal semantic segmentation by enforcing a tightly coupled feature representation and a symmetric information-sharing scheme, which allows our approach to work even when one of the input modalities is missing. We also introduce an ad-hoc class-incremental continual learning scheme, proving our approach's effectiveness and reliability even in safety-critical settings, such as autonomous driving. We evaluate our approach on the SemanticKITTI dataset, achieving impressive performances.
Paper Structure (9 sections, 4 equations, 5 figures, 3 tables)

This paper contains 9 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We introduce a new symmetric multimodal architecture that, differently from competitors zhuang2021perceptionaware, works reliably even when one modality is missing. Our regularization strategies also allow for a natural extension to continual learning scenarios, where we relax the restriction of needing both modalities at the same time.
  • Figure 2: Overall architecture of our method taking in input an RGB and LiDAR sample in different branches, aligning features at multiple levels, then combining the output from the two network branches to provide a single joint segmentation map. The catastrophic forgetting in continual learning is tackled via knowledge distillation losses that help the modalities jointly by sharing information from the other.
  • Figure 3: Example of inpainting process, the unknown class pixels (black) of the input image are filled in using the prediction of the network obtaining the inpainted output image. Points are expanded with a circular kernel for clarity, real size is one pixel.
  • Figure 5: Qualitative results. The top rows contain the image branch predictions, bottom rows contain the LiDAR branch predictions.
  • Figure : Average mIoU drop ($\downarrow$) in the incremental steps. Per-class drop is computed by dividing the average drop by the number of incremental classes in the incremental steps.