Table of Contents
Fetching ...

Retrieval-enriched zero-shot image classification in low-resource domains

Nicola Dall'Asen, Yiming Wang, Enrico Fini, Elisa Ricci

TL;DR

This method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases, which significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class.

Abstract

Low-resource domains, characterized by scarce data and annotations, present significant challenges for language and visual understanding tasks, with the latter much under-explored in the literature. Recent advancements in Vision-Language Models (VLM) have shown promising results in high-resource domains but fall short in low-resource concepts that are under-represented (e.g. only a handful of images per category) in the pre-training set. We tackle the challenging task of zero-shot low-resource image classification from a novel perspective. By leveraging a retrieval-based strategy, we achieve this in a training-free fashion. Specifically, our method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases. This retrieval-based enrichment significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class. We validate our method on a newly established benchmark covering diverse low-resource domains, including medical imaging, rare plants, and circuits. Our experiments demonstrate that CORE outperforms existing state-of-the-art methods that rely on synthetic data generation and model fine-tuning.

Retrieval-enriched zero-shot image classification in low-resource domains

TL;DR

This method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases, which significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class.

Abstract

Low-resource domains, characterized by scarce data and annotations, present significant challenges for language and visual understanding tasks, with the latter much under-explored in the literature. Recent advancements in Vision-Language Models (VLM) have shown promising results in high-resource domains but fall short in low-resource concepts that are under-represented (e.g. only a handful of images per category) in the pre-training set. We tackle the challenging task of zero-shot low-resource image classification from a novel perspective. By leveraging a retrieval-based strategy, we achieve this in a training-free fashion. Specifically, our method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases. This retrieval-based enrichment significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class. We validate our method on a newly established benchmark covering diverse low-resource domains, including medical imaging, rare plants, and circuits. Our experiments demonstrate that CORE outperforms existing state-of-the-art methods that rely on synthetic data generation and model fine-tuning.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Our retrieval-based solution enriches both images and textual descriptors with real-world captions which contain domains and classes. Even when the captions are generic (third row for each example), they can still restrict the focus to the correct domain.
  • Figure 2: Our CoRE enriches both the image embedding $z_q$ and the class prompts $p$ with retrieved captions from a large-scale web-crawled database $\mathbb{D}$. We weight the retrieved captions $\mathcal{T}$ with their similarity scores $\mathcal{S}^T$, which we skew with controllable temperatures $\tau_{i2t}$ and $\tau_{t2t}$. By combining the retrieved captions embedding with the original representations $W$ and $q$ through $\alpha$ and $\beta$, we obtain enriched representations $W^+$ and $z_q^+$ which we employ for zero-shot classification.
  • Figure 3: Top-1 accuracy of CoRE CC12M on Circuits with varying $\alpha$ and $\beta$. CoRE achieves the best performance with a balanced merge of image-retrieved captions ($\beta\sim 0.5$), while for class-relevant captions the best weighting is slightly lower ($\alpha\sim 0.2$).