OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation
Dongjun Hwang, Yejin Kim, Minyoung Lee, Seong Joon Oh, Junsuk Choe
TL;DR
This work tackles Open-Vocabulary Segmentation (OVS) under sequential data collection, where new datasets arrive over time. It introduces ConOVS, a Mixture-of-Experts continual learning framework that builds an expert per incremental dataset and dynamically interpolates decoder weights at inference using MVN-based proximity to each dataset’s distribution. The method demonstrates consistent improvements over retraining, fine-tuning, and existing continual learning baselines across pre-training, incremental, and zero-shot evaluations, and scales to multiple incremental datasets with reasonable resources. By enabling per-sample expert fusion without full retraining, ConOVS provides a sustainable approach to expanding OVS capabilities in real-world, continually evolving environments.
Abstract
Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.
