Table of Contents
Fetching ...

OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation

Dongjun Hwang, Yejin Kim, Minyoung Lee, Seong Joon Oh, Junsuk Choe

TL;DR

This work tackles Open-Vocabulary Segmentation (OVS) under sequential data collection, where new datasets arrive over time. It introduces ConOVS, a Mixture-of-Experts continual learning framework that builds an expert per incremental dataset and dynamically interpolates decoder weights at inference using MVN-based proximity to each dataset’s distribution. The method demonstrates consistent improvements over retraining, fine-tuning, and existing continual learning baselines across pre-training, incremental, and zero-shot evaluations, and scales to multiple incremental datasets with reasonable resources. By enabling per-sample expert fusion without full retraining, ConOVS provides a sustainable approach to expanding OVS capabilities in real-world, continually evolving environments.

Abstract

Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.

OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation

TL;DR

This work tackles Open-Vocabulary Segmentation (OVS) under sequential data collection, where new datasets arrive over time. It introduces ConOVS, a Mixture-of-Experts continual learning framework that builds an expert per incremental dataset and dynamically interpolates decoder weights at inference using MVN-based proximity to each dataset’s distribution. The method demonstrates consistent improvements over retraining, fine-tuning, and existing continual learning baselines across pre-training, incremental, and zero-shot evaluations, and scales to multiple incremental datasets with reasonable resources. By enabling per-sample expert fusion without full retraining, ConOVS provides a sustainable approach to expanding OVS capabilities in real-world, continually evolving environments.

Abstract

Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.

Paper Structure

This paper contains 43 sections, 1 equation, 7 figures, 28 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Comparison of the performance of the OVS model (fc-clip fcclip), Retraining, Fine-tuning, and ConOVS against the closed-set segmentation model OneFormer. (b) Performance of the Baseline (fc-clip fcclip), Fine-tuning, Retraining, and ConOVS on the pre-training, incremental, and zero-shot test datasets. PQ is used.
  • Figure 2: (a) Performance degradation on the pre-training and zero-shot datasets after fine-tuning. fc-clip is used. (b) Comparison of the performance of OneFormer oneformer, the baseline (fc-clip fcclip), retraining, fine-tuning, three existing continual learning methods ewclwfeclipse, and ConOVS on the pre-training and incremental datasets. All methods use the same iterations. PQ is used.
  • Figure 3: Overview of the inference process of our proposed method.
  • Figure 4: Interpolation factor behavior across different input sample distributions.
  • Figure C1: Performance on the evaluation set of Cityscapes and COCO depending on the interpolation factor $\lambda$, using fc-clip.
  • ...and 2 more figures