Table of Contents
Fetching ...

Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy

TL;DR

This work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources.

Abstract

Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.

Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

TL;DR

This work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources.

Abstract

Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.
Paper Structure (18 sections, 2 figures, 3 tables)

This paper contains 18 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Methodological Overview. This figure illustrates our three-stage pipeline for predicting a Vision-Language Model's zero-shot accuracy on a target domain, using the "Ekwang" class from the African Food dataset as an example. Counterfactual Probing: A single representative image (Step 1) is used to generate a plausible caption $T_{pc}$ (Step 2) and a set of counterfactual captions $\{T_{cf_i}\}$ via an LLM (Steps 3-4). Similarity Scoring: The VLFM under evaluation is used to compute embeddings for the image ($\mathcal{I}$) and captions (Step 5). $\{T_{L_i}\}$ represents the standard CLIP zero-shot text prompt for class $i$, e.g. "a photo of Ekwang". Two sets of similarity scores are calculated: one for the standard CLIP zero-shot prompts and another for the LLM-generated captions (Step 6). Performance Prediction: The similarity scores are used as input to a Ridge Regression Model. (Steps 7). The model is trained to estimate the VLFM's zero-shot accuracy on the full test set using only one labelled image per class.
  • Figure 2: Ground-Truth vs. Predicted Zero-Shot Accuracy. The scatter plot depicts our method's predictions across 16 datasets. The x-axis shows the actual zero-shot accuracy of OpenCLIP-ViT-B/16, calculated on each dataset's full test set, and the y-axis shows our predicted accuracy using only a single labelled image per class. The dashed black line indicates perfect agreement. Blue points are datasets used to train our Ridge Regression Model; red points are unseen test datasets. The strong test correlation demonstrates our method's accuracy and generalisation, even in underrepresented domains such as African Food and Beans.