Table of Contents
Fetching ...

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

TL;DR

This work tackles open-vocabulary segmentation under continually expanding vocabularies and the problem of catastrophic forgetting when fine-tuning; it proposes a training-free retrieval framework. kNN-CLIP builds a dynamically growing embedding database and uses a retrieval mechanism to augment CLIP-based dense predictors, avoiding retraining and preserving zero-forgetting. It represents retrieval via $P_{ret}$ and fuses it with base predictions to produce $P_{final}$ under a confidence threshold $T$ and weight $\lambda$, enabling seamless vocabulary expansion. Empirically, it achieves state-of-the-art open-vocabulary semantic and panoptic segmentation on large-vocabulary benchmarks, with gains such as +7.2 mIoU on ADE20K, while maintaining low memory and compute overhead.

Abstract

Continual segmentation has not yet tackled the challenge of improving open-vocabulary segmentation models with training data for accurate segmentation across large, continually expanding vocabularies. We discover that traditional continual training results in severe catastrophic forgetting, failing to outperform a zero-shot segmentation baseline. We introduce a novel training-free strategy, kNN-CLIP, which augments the model with a database of instance embeddings for semantic and panoptic segmentation that achieves zero forgetting. We demonstrate that kNN-CLIP can adapt to continually growing vocabularies without the need for retraining or large memory costs. kNN-CLIP enables open-vocabulary segmentation methods to expand their vocabularies on any domain with a single pass through the data, while only storing compact embeddings. This approach minimizes both compute and memory costs. kNN-CLIP achieves state-of-the-art performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a significant step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

TL;DR

This work tackles open-vocabulary segmentation under continually expanding vocabularies and the problem of catastrophic forgetting when fine-tuning; it proposes a training-free retrieval framework. kNN-CLIP builds a dynamically growing embedding database and uses a retrieval mechanism to augment CLIP-based dense predictors, avoiding retraining and preserving zero-forgetting. It represents retrieval via and fuses it with base predictions to produce under a confidence threshold and weight , enabling seamless vocabulary expansion. Empirically, it achieves state-of-the-art open-vocabulary semantic and panoptic segmentation on large-vocabulary benchmarks, with gains such as +7.2 mIoU on ADE20K, while maintaining low memory and compute overhead.

Abstract

Continual segmentation has not yet tackled the challenge of improving open-vocabulary segmentation models with training data for accurate segmentation across large, continually expanding vocabularies. We discover that traditional continual training results in severe catastrophic forgetting, failing to outperform a zero-shot segmentation baseline. We introduce a novel training-free strategy, kNN-CLIP, which augments the model with a database of instance embeddings for semantic and panoptic segmentation that achieves zero forgetting. We demonstrate that kNN-CLIP can adapt to continually growing vocabularies without the need for retraining or large memory costs. kNN-CLIP enables open-vocabulary segmentation methods to expand their vocabularies on any domain with a single pass through the data, while only storing compact embeddings. This approach minimizes both compute and memory costs. kNN-CLIP achieves state-of-the-art performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a significant step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.
Paper Structure (16 sections, 4 equations, 4 figures, 7 tables)

This paper contains 16 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: We propose kNN-CLIP to continually expand the vocabulary space of segmentation models. Our approach adapts to concept customization and identifying long-tailed concepts, a known challenge for CLIP models udandarao2024no. For concept customization, we build the supporting database for each long-tailed concept efficiently using C2C prabhu2023categories. We use EntitySeg qi2022open to generate class-agnostic masks for entities in the database and kNN-CLIP to label these masks. At inference time, we filter the masks from EntitySeg based on confidence thresholds for both the mask and class predictions.
  • Figure 2: Dynamic Database Construction. A key benefit of our methodology is that it allows for the seamless integration of embeddings for new classes into our database, continuously expanding its vocabulary space.
  • Figure 3: Retrieval Augmentation from the Database. By retrieving similar features from the database, we integrate the retrieved information with our previous prediction.
  • Figure 4: Augmenting FC-CLIP. We integrate the retrieval augmentation module to the state-of-the-art segmentation model, FC-CLIP. FC-CLIP includes an in-vocabulary branch and an out-of-vocabulary branch. We don't shown the original out-of-vocabulary branch here for simplicity. We use kNN-CLIP to augment the out-of-vocabulary branch using DINOv2 features and retrieved information.