Table of Contents
Fetching ...

BloomCoreset: Fast Coreset Sampling using Bloom Filters for Fine-Grained Self-Supervised Learning

Prajwal Singh, Gautam Vashishtha, Indra Deep Mastan, Shanmuganathan Raman

TL;DR

Open-Set SSL for fine-grained recognition suffers from expensive coreset sampling from large unlabeled pools. BloomCoreset proposes a Counting Bloom Filter-based pipeline that indexes Open-Set and domain-specific Open-CLIP features to quickly retrieve semantically aligned samples, followed by top-k filtering to reduce false positives. It integrates with the SimCore framework to train a fine-grained SSL model, and experiments across 11 downstream datasets show substantial speedups with only a small average accuracy trade-off. The approach enables scalable, domain-specific SSL with limited labeled data and demonstrates robust cross-dataset performance.

Abstract

The success of deep learning in supervised fine-grained recognition for domain-specific tasks relies heavily on expert annotations. The Open-Set for fine-grained Self-Supervised Learning (SSL) problem aims to enhance performance on downstream tasks by strategically sampling a subset of images (the Core-Set) from a large pool of unlabeled data (the Open-Set). In this paper, we propose a novel method, BloomCoreset, that significantly reduces sampling time from Open-Set while preserving the quality of samples in the coreset. To achieve this, we utilize Bloom filters as an innovative hashing mechanism to store both low- and high-level features of the fine-grained dataset, as captured by Open-CLIP, in a space-efficient manner that enables rapid retrieval of the coreset from the Open-Set. To show the effectiveness of the sampled coreset, we integrate the proposed method into the state-of-the-art fine-grained SSL framework, SimCore [1]. The proposed algorithm drastically outperforms the sampling strategy of the baseline in SimCore [1] with a $98.5\%$ reduction in sampling time with a mere $0.83\%$ average trade-off in accuracy calculated across $11$ downstream datasets.

BloomCoreset: Fast Coreset Sampling using Bloom Filters for Fine-Grained Self-Supervised Learning

TL;DR

Open-Set SSL for fine-grained recognition suffers from expensive coreset sampling from large unlabeled pools. BloomCoreset proposes a Counting Bloom Filter-based pipeline that indexes Open-Set and domain-specific Open-CLIP features to quickly retrieve semantically aligned samples, followed by top-k filtering to reduce false positives. It integrates with the SimCore framework to train a fine-grained SSL model, and experiments across 11 downstream datasets show substantial speedups with only a small average accuracy trade-off. The approach enables scalable, domain-specific SSL with limited labeled data and demonstrates robust cross-dataset performance.

Abstract

The success of deep learning in supervised fine-grained recognition for domain-specific tasks relies heavily on expert annotations. The Open-Set for fine-grained Self-Supervised Learning (SSL) problem aims to enhance performance on downstream tasks by strategically sampling a subset of images (the Core-Set) from a large pool of unlabeled data (the Open-Set). In this paper, we propose a novel method, BloomCoreset, that significantly reduces sampling time from Open-Set while preserving the quality of samples in the coreset. To achieve this, we utilize Bloom filters as an innovative hashing mechanism to store both low- and high-level features of the fine-grained dataset, as captured by Open-CLIP, in a space-efficient manner that enables rapid retrieval of the coreset from the Open-Set. To show the effectiveness of the sampled coreset, we integrate the proposed method into the state-of-the-art fine-grained SSL framework, SimCore [1]. The proposed algorithm drastically outperforms the sampling strategy of the baseline in SimCore [1] with a reduction in sampling time with a mere average trade-off in accuracy calculated across downstream datasets.

Paper Structure

This paper contains 10 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Fine-Grained SSL Framework. The figure illustrates the workflow for addressing a fine-grained SSL problem. We use the pre-trained OpenCLIP model cherti2023reproducible to extract image features (hash codes) from domain-specific and open-set image pools. These features are then processed through the BloomCoreset algorithm, which generates a coreset from the open-set image pool. The coreset and domain-specific data are subsequently used to train the self-supervised method.
  • Figure 2: BloomCoreset. Domain-specific features are used to build the Counting Bloom Filter (CBF). A membership test is then conducted to sample Open-Set images similar to domain-specific ones. After sampling, the inner product is calculated, and additional filtering is applied to select the best subset from the sampled Open-Set images. A detailed overview of the sampling method is provided in Section \ref{['subsec:coresetexpalin']}.
  • Figure 3: Other Open-Sets. While comparing across different Open-Sets, the learned representation from the coreset sampled using the proposed method shows competitive performance to the baseline and, in some cases, outperforms the baseline.
  • Figure 4: Feature Distribution. Show representation space of Open-CLIP cherti2023reproducible. We have used Gaussian kernel density estimation wang2020understanding to show the feature distribution of downstream dataset samples and coreset samples across the unit ring.
  • Figure : Sampling image subset (coreset) from the open-set images