kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies
Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr
TL;DR
This work tackles open-vocabulary segmentation under continually expanding vocabularies and the problem of catastrophic forgetting when fine-tuning; it proposes a training-free retrieval framework. kNN-CLIP builds a dynamically growing embedding database and uses a retrieval mechanism to augment CLIP-based dense predictors, avoiding retraining and preserving zero-forgetting. It represents retrieval via $P_{ret}$ and fuses it with base predictions to produce $P_{final}$ under a confidence threshold $T$ and weight $\lambda$, enabling seamless vocabulary expansion. Empirically, it achieves state-of-the-art open-vocabulary semantic and panoptic segmentation on large-vocabulary benchmarks, with gains such as +7.2 mIoU on ADE20K, while maintaining low memory and compute overhead.
Abstract
Continual segmentation has not yet tackled the challenge of improving open-vocabulary segmentation models with training data for accurate segmentation across large, continually expanding vocabularies. We discover that traditional continual training results in severe catastrophic forgetting, failing to outperform a zero-shot segmentation baseline. We introduce a novel training-free strategy, kNN-CLIP, which augments the model with a database of instance embeddings for semantic and panoptic segmentation that achieves zero forgetting. We demonstrate that kNN-CLIP can adapt to continually growing vocabularies without the need for retraining or large memory costs. kNN-CLIP enables open-vocabulary segmentation methods to expand their vocabularies on any domain with a single pass through the data, while only storing compact embeddings. This approach minimizes both compute and memory costs. kNN-CLIP achieves state-of-the-art performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a significant step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.
