Table of Contents
Fetching ...

IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation

Xiao Yu, Yan Fang, Yao Zhao, Yunchao Wei

TL;DR

This work tackles semantic drift in class-incremental semantic segmentation (CISS) caused by separate optimization and noisy pseudo-labels. It proposes IPSeg, which integrates image posterior guidance to align optimization across incremental stages and permanent-temporary semantics decoupling to treat stable background/unknown semantics separately from dynamic foreground targets. Through extensive experiments on VOC 2012 and ADE20K, IPSeg achieves state-of-the-art performance, especially in long-term incremental scenarios, and demonstrates robustness to forgetting while efficiently leveraging memory buffers. The approach advances practical continual learning for pixel-wise tasks and highlights a trade-off between performance and memory/privacy considerations due to replay, with potential future work aiming to remove memory buffers.

Abstract

Class incremental learning aims to enable models to learn from sequential, non-stationary data streams across different tasks without catastrophic forgetting. In class incremental semantic segmentation (CISS), the semantic content of image pixels evolves over incremental phases, known as semantic drift. In this work, we identify two critical challenges in CISS that contribute to semantic drift and degrade performance. First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages, leading to misaligned probability scales. Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results. To address these challenges, we propose a novel and effective approach, Image Posterior and Semantics Decoupling for Segmentation (IPSeg). IPSeg introduces two key mechanisms: (1) leveraging image posterior probabilities to align optimization across stages and mitigate the effects of separate optimization, and (2) employing semantics decoupling to handle noisy semantics and tailor learning strategies for different semantics. Extensive experiments on the Pascal VOC 2012 and ADE20K datasets demonstrate that IPSeg achieves superior performance compared to state-of-the-art methods, particularly in challenging long-term incremental scenarios.

IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation

TL;DR

This work tackles semantic drift in class-incremental semantic segmentation (CISS) caused by separate optimization and noisy pseudo-labels. It proposes IPSeg, which integrates image posterior guidance to align optimization across incremental stages and permanent-temporary semantics decoupling to treat stable background/unknown semantics separately from dynamic foreground targets. Through extensive experiments on VOC 2012 and ADE20K, IPSeg achieves state-of-the-art performance, especially in long-term incremental scenarios, and demonstrates robustness to forgetting while efficiently leveraging memory buffers. The approach advances practical continual learning for pixel-wise tasks and highlights a trade-off between performance and memory/privacy considerations due to replay, with potential future work aiming to remove memory buffers.

Abstract

Class incremental learning aims to enable models to learn from sequential, non-stationary data streams across different tasks without catastrophic forgetting. In class incremental semantic segmentation (CISS), the semantic content of image pixels evolves over incremental phases, known as semantic drift. In this work, we identify two critical challenges in CISS that contribute to semantic drift and degrade performance. First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages, leading to misaligned probability scales. Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results. To address these challenges, we propose a novel and effective approach, Image Posterior and Semantics Decoupling for Segmentation (IPSeg). IPSeg introduces two key mechanisms: (1) leveraging image posterior probabilities to align optimization across stages and mitigate the effects of separate optimization, and (2) employing semantics decoupling to handle noisy semantics and tailor learning strategies for different semantics. Extensive experiments on the Pascal VOC 2012 and ADE20K datasets demonstrate that IPSeg achieves superior performance compared to state-of-the-art methods, particularly in challenging long-term incremental scenarios.

Paper Structure

This paper contains 45 sections, 9 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: (a) Due to the existence of separate optimization, the previous method SSUL-M misclassifies a "horse" as a "cow" with higher logit scores when learning "horse" following "cow". While our IPSeg leverages image posterior (IP) guidance to produce accurate predictions on these two similar-look classes. The "logit scores" refer to pixel-wise prediction, and the image posterior refers to our introduced image-wise prediction. The logit numbers are used for better illustration. (b) The quantitative performance comparison with state-of-the-art methods under the long-term incremental challenge (VOC 2-2).
  • Figure 2: Overall architecture of our proposed IPSeg, mainly composed of image posterior and permanent-temporary semantics decoupling two parts. In the latter part, $\phi_p$ denotes the permanent learning branch and $\phi_1, \phi_2, ..., \phi_t$ for temporary ones. The black solid lines are used to indicate the data flow in training and the green ones are for inference.
  • Figure 3: (a) The overall performance of different methods on Pascal VOC 2012 under 4 scenarios, (b) mIoU visualization on Pascal VOC 2012 2-2, (c) mIoU visualization on Pascal VOC 2012 15-1.
  • Figure 4: The visualization of separate optimization. Sequence A: first learn "cow", "bus", "sofa" in step 1, then "horse", "car", "chair" in step 2. Sequence B: first learn "horse", "car", "chair" in step 1, then "cow", "bus", "sofa" in step 2.
  • Figure 5: The probability distributions for SSUL-M, IPSeg, and Joint-Training (Joint) in the regions of incorrect predictions. Class indexes "10" and "17" represent "cow" and "sheep" respectively.
  • ...and 4 more figures