Table of Contents
Fetching ...

ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora

Nikolas Adaloglou, Diana Petrusheva, Mohamed Asker, Felix Michels, Markus Kollmann

TL;DR

This paper tackles unsupervised visual OOD detection by eliminating reliance on predefined in-distribution label names. It introduces ClusterMine, a cluster-based positive label mining method that derives ID-related concepts from a large text corpus and enforces visual-consistency via TEMI clustering to map clusters to label names with majority voting. ClusterMine, operating without ground-truth ID labels, achieves state-of-the-art AUROC across multiple CLIP models and OOD benchmarks, and shows robust performance under covariate shifts and near-OOD conditions. The work also analyzes label-quality and ablations, demonstrating that cluster-based positive mining can outperform traditional negative-label mining and dependence on GT labels, with practical implications for scalable, unsupervised OOD detection in vision-language systems.

Abstract

Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency. Our experimental study reveals that ClusterMine is scalable across a plethora of CLIP models and achieves state-of-the-art robustness to covariate in-distribution shifts. The code is available at https://github.com/HHU-MMBS/clustermine_wacv_official.

ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora

TL;DR

This paper tackles unsupervised visual OOD detection by eliminating reliance on predefined in-distribution label names. It introduces ClusterMine, a cluster-based positive label mining method that derives ID-related concepts from a large text corpus and enforces visual-consistency via TEMI clustering to map clusters to label names with majority voting. ClusterMine, operating without ground-truth ID labels, achieves state-of-the-art AUROC across multiple CLIP models and OOD benchmarks, and shows robust performance under covariate shifts and near-OOD conditions. The work also analyzes label-quality and ablations, demonstrating that cluster-based positive mining can outperform traditional negative-label mining and dependence on GT labels, with practical implications for scalable, unsupervised OOD detection in vision-language systems.

Abstract

Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency. Our experimental study reveals that ClusterMine is scalable across a plethora of CLIP models and achieves state-of-the-art robustness to covariate in-distribution shifts. The code is available at https://github.com/HHU-MMBS/clustermine_wacv_official.

Paper Structure

This paper contains 21 sections, 2 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: An overview of the label mining framework for OOD detection using CLIP. Given a text corpus $\mathcal{Y}_{\text{corpus}}$ and its feature representation $\mathcal{Z}_{\text{corpus}}$, ClusterMine and PosMine aim to extract in-distribution-related class names $\mathcal{Y}_{\text{pos}}$ in the shared vision-language space of CLIP. $\mathcal{Z}_{\text{neg}}$ can be either realized as the non-overlapping elements of $\mathcal{Y}_{\text{pos}}$ and $\mathcal{Y}_{\text{corpus}}$, or as the most dissimilar text representations from $\mathcal{Z}_{\text{pos}}$ (negative label mining). The OOD detection score is $S(x)$. Best viewed in color.
  • Figure 2: A visual illustration of ClusterMine.
  • Figure 3: Scalable out-of-distribution detection AUROC (%, y-axis) using various pretrained CLIP weights (x-axis). Unlike previous state-of-the-art methods that require $\mathcal{Y}_{\text{GT}}$ (MCM, NegLabel), ClusterMine and PosMine extract the in-distribution-related label names from a text corpus. Mean AUROC is computed across six OOD datasets, using ImageNet-1K as ID. We use the WordNet miller1995wordnet corpus. Pretrained CLIP models are sorted based on their performance with respect to ClusterMine.
  • Figure 4: OOD detection robustness to multiple ID shifts (x-axis) compared to ImageNet using CLIP ViT-H dfn5b fang2023data_dfn. The relative AUROC difference in % of each method compared to its ImageNet score is shown on top of each bar. We report the mean AUROC ($\uparrow$,%) across six different OOD datasets (y-axis).
  • Figure 5: Analysis of mined label name quality. We calculate top-1 text-text similarity with GT (left), and by finding the shortest path (minimum amount of hops) to GT in WordNet (right).
  • ...and 6 more figures