Table of Contents
Fetching ...

BACS: Background Aware Continual Semantic Segmentation

Mostafa ElAraby, Ali Harakeh, Liam Paull

TL;DR

The paper tackles continual semantic segmentation (CSS) under background shift by introducing BACS, a backward background shift detector that uses latent-space foreground prototypes to distinguish old/foreground from background. It couples this detector with a two-part, background-shift aware loss, masked knowledge distillation, and dark experience replay, and replaces separate heads with a transformer-based decoder that appends new class tokens as needed. Empirical results on Pascal VOC 2012 and Cityscapes in the challenging overlap mode show that BACS achieves superior plasticity and stability compared to state-of-the-art baselines, with robustness to class ordering and initialization strategies. The work offers a practical, annotation-efficient approach for scalable CSS in robotics and autonomous systems, enabling seamless addition of new classes with limited memory and compute.

Abstract

Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks.

BACS: Background Aware Continual Semantic Segmentation

TL;DR

The paper tackles continual semantic segmentation (CSS) under background shift by introducing BACS, a backward background shift detector that uses latent-space foreground prototypes to distinguish old/foreground from background. It couples this detector with a two-part, background-shift aware loss, masked knowledge distillation, and dark experience replay, and replaces separate heads with a transformer-based decoder that appends new class tokens as needed. Empirical results on Pascal VOC 2012 and Cityscapes in the challenging overlap mode show that BACS achieves superior plasticity and stability compared to state-of-the-art baselines, with robustness to class ordering and initialization strategies. The work offers a practical, annotation-efficient approach for scalable CSS in robotics and autonomous systems, enabling seamless addition of new classes with limited memory and compute.

Abstract

Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks.
Paper Structure (22 sections, 7 equations, 4 figures, 4 tables)

This paper contains 22 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: BACS framework overview. A backward background detector $b_{\omega}^t$ down weights background pixels detected that have appeared as classes in previous steps to avoid the collapse of old classes into the background.
  • Figure 2: Our continual learning framework BACS consists of the backward background detector, shown in blue, and a transformer decoder. The backward background detector compares the latent space of each pixel with a per-step centroid to detect the foreground. The maximum output probability of all heads, $\max{Fg^{1:t-1}}$, is used to reduce the emphasis on pixels that belong to old classes collapsing to the background class in step $t$ ground truth. Next, the transformer decoder allows the addition of new classes by initializing new class tokens.
  • Figure 3: mIoU Evaluation of 10 different class orderings between BACS, MiB, and PLOP.
  • Figure 4: Qualitative comparison between BACS, MiB and PLOP on $15-1$ VOC setup. Left column: Predictions after learning two tasks, not including the upcoming sofa class. Middle column: Predictions after incrementing the sofa class. Right column: Predictions at the end of the training.