Table of Contents
Fetching ...

Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy

TL;DR

Hold-One-Shot-Out (HOSO) presents a novel approach for CLIP-Adapter-style methods to compete in the newly established validation-free setting, and under the validation-free few-shot protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets.

Abstract

In many CLIP adaptation methods, a blending ratio hyperparameter controls the trade-off between general pretrained CLIP knowledge and the limited, dataset-specific supervision from the few-shot cases. Most few-shot CLIP adaptation techniques report results by ablation of the blending ratio on the test set or require additional validation sets to select the blending ratio per dataset, and thus are not strictly few-shot. We present a simple, validation-free method for learning the blending ratio in CLIP adaptation. Hold-One-Shot-Out (HOSO) presents a novel approach for CLIP-Adapter-style methods to compete in the newly established validation-free setting. CLIP-Adapter with HOSO (HOSO-Adapter) learns the blending ratio using a one-shot, hold-out set, while the adapter trains on the remaining few-shot support examples. Under the validation-free few-shot protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets. Interestingly, in the 8- and 16-shot settings, HOSO-Adapter outperforms CLIP-Adapter even with the optimal blending ratio selected on the test set. Ablation studies validate the use of a one-shot hold-out mechanism, decoupled training, and improvements over the naively learnt blending ratio baseline. Code is released here: https://github.com/chris-vorster/HOSO-Adapter

Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

TL;DR

Hold-One-Shot-Out (HOSO) presents a novel approach for CLIP-Adapter-style methods to compete in the newly established validation-free setting, and under the validation-free few-shot protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets.

Abstract

In many CLIP adaptation methods, a blending ratio hyperparameter controls the trade-off between general pretrained CLIP knowledge and the limited, dataset-specific supervision from the few-shot cases. Most few-shot CLIP adaptation techniques report results by ablation of the blending ratio on the test set or require additional validation sets to select the blending ratio per dataset, and thus are not strictly few-shot. We present a simple, validation-free method for learning the blending ratio in CLIP adaptation. Hold-One-Shot-Out (HOSO) presents a novel approach for CLIP-Adapter-style methods to compete in the newly established validation-free setting. CLIP-Adapter with HOSO (HOSO-Adapter) learns the blending ratio using a one-shot, hold-out set, while the adapter trains on the remaining few-shot support examples. Under the validation-free few-shot protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets. Interestingly, in the 8- and 16-shot settings, HOSO-Adapter outperforms CLIP-Adapter even with the optimal blending ratio selected on the test set. Ablation studies validate the use of a one-shot hold-out mechanism, decoupled training, and improvements over the naively learnt blending ratio baseline. Code is released here: https://github.com/chris-vorster/HOSO-Adapter
Paper Structure (33 sections, 5 equations, 10 figures, 15 tables)

This paper contains 33 sections, 5 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: CLIP 1-shot versus full test-set accuracy is strongly correlated across multiple runs. Inspired by this, we use a single hold-out shot to learn the blending ratio for CLIP adapters.
  • Figure 2: The optimal blending ratio reflects a dataset-specific trade-off between the CLIP prior and task adaptation. Fine-grained datasets like Stanford Cars benefit from a higher $\alpha$ to learn new features, while the general-domain ImageNet performs better with a lower $\alpha$ that preserves the strong prior. This variability makes any fixed ratio suboptimal, motivating our adaptive, validation-free approach.
  • Figure 3: HOSO-Adapter performance across few-shot settings (K=2, 4, 8, 16) on the ResNet-50 backbone with averaged results over 11 datasets. Our method achieves state-of-the-art performance at K=8 and K=16, surpassing even the CLIP-Adapter Oracle baseline. The Oracle baseline, which uses a blending ratio grid-searched on the test set for each dataset, is included for reference and is not directly comparable in the validation-free setting. CLIP-Adapter's results are the validation-free values from CloserLookFewShot2024_silva-rodriguez.
  • Figure 4: Three example datasets that show the consistent trend: hold-one-shot-out reduces the blending ratio (green) compared to the naively learnt blending ratio that enables overfitting on the limited few-shot cases (red). See Suppl. Section E for full table.
  • Figure 5: Three example datasets that show the consistent trend: hold-one-shot-out (green) overfits less compared to the naively learnt blending ratio (red). See Suppl. Section F for full table.
  • ...and 5 more figures