kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Zhongrui Gui; Shuyang Sun; Runjia Li; Jianhao Yuan; Zhaochong An; Karsten Roth; Ameya Prabhu; Philip Torr

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

TL;DR

This work tackles open-vocabulary segmentation under continually expanding vocabularies and the problem of catastrophic forgetting when fine-tuning; it proposes a training-free retrieval framework. kNN-CLIP builds a dynamically growing embedding database and uses a retrieval mechanism to augment CLIP-based dense predictors, avoiding retraining and preserving zero-forgetting. It represents retrieval via $P_{ret}$ and fuses it with base predictions to produce $P_{final}$ under a confidence threshold $T$ and weight $\lambda$, enabling seamless vocabulary expansion. Empirically, it achieves state-of-the-art open-vocabulary semantic and panoptic segmentation on large-vocabulary benchmarks, with gains such as +7.2 mIoU on ADE20K, while maintaining low memory and compute overhead.

Abstract

Continual segmentation has not yet tackled the challenge of improving open-vocabulary segmentation models with training data for accurate segmentation across large, continually expanding vocabularies. We discover that traditional continual training results in severe catastrophic forgetting, failing to outperform a zero-shot segmentation baseline. We introduce a novel training-free strategy, kNN-CLIP, which augments the model with a database of instance embeddings for semantic and panoptic segmentation that achieves zero forgetting. We demonstrate that kNN-CLIP can adapt to continually growing vocabularies without the need for retraining or large memory costs. kNN-CLIP enables open-vocabulary segmentation methods to expand their vocabularies on any domain with a single pass through the data, while only storing compact embeddings. This approach minimizes both compute and memory costs. kNN-CLIP achieves state-of-the-art performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a significant step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

TL;DR

and fuses it with base predictions to produce

under a confidence threshold

and weight

, enabling seamless vocabulary expansion. Empirically, it achieves state-of-the-art open-vocabulary semantic and panoptic segmentation on large-vocabulary benchmarks, with gains such as +7.2 mIoU on ADE20K, while maintaining low memory and compute overhead.

Abstract

Paper Structure (16 sections, 4 equations, 4 figures, 7 tables)

This paper contains 16 sections, 4 equations, 4 figures, 7 tables.

Introduction
Related Work
kNN-CLIP: Continually Expanding Retrieval-Augmented Dense Prediction
Designing Continually Expanding Embedding Databases
Database Construction.
Inference Using the Continually Expanding Embedding Databases
Augmenting FC-CLIP
Experiments
Implementation Details
Catastrophic Forgetting Restricts Open-Vocabulary Capabilities of Models
Comparison with Continual Segmentation Approaches
Retrieval Enhances Panoptic Segmentation
Retrieval Enhances Semantic Segmentation
Comparisons with Retrieval-based Approaches
Ablations
...and 1 more sections

Figures (4)

Figure 1: We propose kNN-CLIP to continually expand the vocabulary space of segmentation models. Our approach adapts to concept customization and identifying long-tailed concepts, a known challenge for CLIP models udandarao2024no. For concept customization, we build the supporting database for each long-tailed concept efficiently using C2C prabhu2023categories. We use EntitySeg qi2022open to generate class-agnostic masks for entities in the database and kNN-CLIP to label these masks. At inference time, we filter the masks from EntitySeg based on confidence thresholds for both the mask and class predictions.
Figure 2: Dynamic Database Construction. A key benefit of our methodology is that it allows for the seamless integration of embeddings for new classes into our database, continuously expanding its vocabulary space.
Figure 3: Retrieval Augmentation from the Database. By retrieving similar features from the database, we integrate the retrieved information with our previous prediction.
Figure 4: Augmenting FC-CLIP. We integrate the retrieval augmentation module to the state-of-the-art segmentation model, FC-CLIP. FC-CLIP includes an in-vocabulary branch and an out-of-vocabulary branch. We don't shown the original out-of-vocabulary branch here for simplicity. We use kNN-CLIP to augment the out-of-vocabulary branch using DINOv2 features and retrieved information.

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

TL;DR

Abstract

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Authors

TL;DR

Abstract

Table of Contents

Figures (4)