ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer
Bowen Dong, Guanglei Yang, Wangmeng Zuo, Lei Zhang
TL;DR
ConSept introduces an adapter-based Vision Transformer (ViT) framework for continual semantic segmentation that maintains strong performance on old classes while enabling robust generalization to novel classes. By inserting lightweight attention-based adapters into a pretrained ViT and using a simple linear segmentation head, ConSept achieves competitive or state-of-the-art results without heavy decoders. The method further enhances anti-catastrophic forgetting via distillation with a deterministic old-class boundary and regularizes segmentation with dual dice losses computed on pseudo-ground-truth maps. Extensive experiments on PASCAL VOC and ADE20K demonstrate the effectiveness and efficiency of ConSept, establishing a solid ViT-based baseline for continual segmentation with strong old-vs-new class performance trade-offs.
Abstract
In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at \url{https://github.com/DongSky/ConSept}.
