BACS: Background Aware Continual Semantic Segmentation
Mostafa ElAraby, Ali Harakeh, Liam Paull
TL;DR
The paper tackles continual semantic segmentation (CSS) under background shift by introducing BACS, a backward background shift detector that uses latent-space foreground prototypes to distinguish old/foreground from background. It couples this detector with a two-part, background-shift aware loss, masked knowledge distillation, and dark experience replay, and replaces separate heads with a transformer-based decoder that appends new class tokens as needed. Empirical results on Pascal VOC 2012 and Cityscapes in the challenging overlap mode show that BACS achieves superior plasticity and stability compared to state-of-the-art baselines, with robustness to class ordering and initialization strategies. The work offers a practical, annotation-efficient approach for scalable CSS in robotics and autonomous systems, enabling seamless addition of new classes with limited memory and compute.
Abstract
Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks.
