Table of Contents
Fetching ...

ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer

Bowen Dong, Guanglei Yang, Wangmeng Zuo, Lei Zhang

TL;DR

ConSept introduces an adapter-based Vision Transformer (ViT) framework for continual semantic segmentation that maintains strong performance on old classes while enabling robust generalization to novel classes. By inserting lightweight attention-based adapters into a pretrained ViT and using a simple linear segmentation head, ConSept achieves competitive or state-of-the-art results without heavy decoders. The method further enhances anti-catastrophic forgetting via distillation with a deterministic old-class boundary and regularizes segmentation with dual dice losses computed on pseudo-ground-truth maps. Extensive experiments on PASCAL VOC and ADE20K demonstrate the effectiveness and efficiency of ConSept, establishing a solid ViT-based baseline for continual segmentation with strong old-vs-new class performance trade-offs.

Abstract

In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at \url{https://github.com/DongSky/ConSept}.

ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer

TL;DR

ConSept introduces an adapter-based Vision Transformer (ViT) framework for continual semantic segmentation that maintains strong performance on old classes while enabling robust generalization to novel classes. By inserting lightweight attention-based adapters into a pretrained ViT and using a simple linear segmentation head, ConSept achieves competitive or state-of-the-art results without heavy decoders. The method further enhances anti-catastrophic forgetting via distillation with a deterministic old-class boundary and regularizes segmentation with dual dice losses computed on pseudo-ground-truth maps. Extensive experiments on PASCAL VOC and ADE20K demonstrate the effectiveness and efficiency of ConSept, establishing a solid ViT-based baseline for continual segmentation with strong old-vs-new class performance trade-offs.

Abstract

In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at \url{https://github.com/DongSky/ConSept}.
Paper Structure (20 sections, 14 equations, 5 figures, 10 tables)

This paper contains 20 sections, 14 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Performance comparison between ConSept and state-of-the-art methods cermelli2020modelingzhang2022microsegshang2023incrementer. ConSept obtains the best performance on all PASCAL VOC benchmarks with overlapped setting. Best viewed in color.
  • Figure 2: Overview of our proposed ConSept. The pipeline is primarily grounded on SSUL cha2021ssul by replacing the segmentation head with vanilla ViT accompanied with a linear head. To fully harness the anti-catastrophic forgetting capability of ViT and enhance the generalization performance in continual segmentation scenarios, we integrate adapters into ViTs, resulting in a dual-path feature extractor with a fully fine-tuning learning paradigm, which is the key element of ConSept. Additionally, ConSept employs feature distillation with a frozen old-class linear head to enhance its anti-catastrophic forgetting ability and incorporate dual dice losses to regularize the segmentation maps for overall segmentation performance.
  • Figure 3: Visual comparison between previous state-of-the-art methods (i.e., SSUL cha2021ssul, MicroSeg zhang2022microseg) and ConSept on PASCAL VOC everingham2010pascal 15-1 benchmark under the overlapped setting. Our method performs the best on both base and novel classes.
  • Figure 4: Visualization of predictions from ConSept on PASCAL VOC 15-1 task with overlapped setting. ConSept exhibits stable anti-catastrophic forgetting ability for old classes and good generalization ability for novel classes.
  • Figure 5: Visualization of ConSept on ADE20K 100-10 task with overlapped setting. ConSept performs well on more challenging tasks.