Table of Contents
Fetching ...

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis

TL;DR

This work tackles out-of-distribution few-shot learning with CLIP-like models by addressing the parameter efficiency versus accuracy tradeoff. It introduces descriptor soup, which greedily selects GPT-derived descriptors, and word soup, which greedily builds chains from a 10k-word pool, both aiming to maximize few-shot training accuracy without requiring test-time LLMs. A diversity loss is proposed to preserve descriptor diversity during finetuning, enabling compatibility with existing few-shot methods. Across cross-dataset and domain-generalization benchmarks, the soups achieve state-of-the-art or competitive accuracy with far fewer trainable parameters and reduced memory, offering a practical approach to robust OOD generalization.

Abstract

Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

TL;DR

This work tackles out-of-distribution few-shot learning with CLIP-like models by addressing the parameter efficiency versus accuracy tradeoff. It introduces descriptor soup, which greedily selects GPT-derived descriptors, and word soup, which greedily builds chains from a 10k-word pool, both aiming to maximize few-shot training accuracy without requiring test-time LLMs. A diversity loss is proposed to preserve descriptor diversity during finetuning, enabling compatibility with existing few-shot methods. Across cross-dataset and domain-generalization benchmarks, the soups achieve state-of-the-art or competitive accuracy with far fewer trainable parameters and reduced memory, offering a practical approach to robust OOD generalization.

Abstract

Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups
Paper Structure (31 sections, 6 equations, 7 figures, 12 tables, 2 algorithms)

This paper contains 31 sections, 6 equations, 7 figures, 12 tables, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of word and descriptor soups. We conceptually position our two soup methods along the tradeoff between parameter efficiency and flexibility; we then list the pros and cons of our soups compared to prior work. Firstly, word soup is more parameter efficient than soft prompt tuning, because it uses discrete tokens (see Fig. \ref{['fig:parameter_efficiency']}). Secondly, word soup does not require an LLM or handcrafted prompts. Lastly, word soup attains higher target accuracy than prior descriptor methods by allowing a descriptor to be any permutation of words and explicitly maximizing its accuracy on training data (see Fig. \ref{['fig:descriptor_and_word_accs']}). However, word soup achieves this flexibility by sacrificing the explainability of descriptors. On the other hand, descriptor soup is interpretable (see Table \ref{['tab:descriptor_examples']}), but less flexible than word soup, since it is limited to selecting from the pool of GPT descriptors.
  • Figure 2: Comparison with PEFT and ZS methods. We vary $m$ for word soup as in Fig. \ref{['fig:m_ablations_pretrained']}. We vary the number of prompt tokens for CoOp, VPT and MaPLe, the number of prompts for ProDA, the rank for LoRA and adapters, and the number of layers tuned for SSF and bitfit. CoOp stores 512 parameters per soft token, while word soup stores 1 parameter per discrete token. Average of 3 runs. Word soup achieves the maximal CoOp accuracy with only 1/25 of the parameters on the XD benchmark and 1/70 of the parameters on the DG benchmark. Detailed results see Tab. \ref{['tab:parameter_efficiency_detailed']} in the Appendix.
  • Figure 3: (Left) Plot of ImageNet accuracy when the same descriptor is appended to every class label. Observe that there are more than 1,000 GPT descriptors and single-word descriptors that are better than standard ZS (in red). When we further consider word chains of length 4, the number of accurate descriptors increases dramatically (orange). (Right) Scatter plot of average target accuracy vs. ImageNet accuracy of GPT descriptors. We observe a positive correlation, so descriptors trained on ImageNet are likely to generalize to other datasets.
  • Figure 4: Varying $\lambda$ in the diversity loss. $\lambda=0$ corresponds to the standard CE loss. The left plot displays the average KL divergence between predicted class probabilities of word soup descriptors over the course of training. The right plot displays the cross-dataset accuracy for the same training runs. We observe that a larger $\lambda$ leads to higher diversity among descriptors; this results in a higher test accuracy.
  • Figure 5: Comparison of our soups with ZS baselines for varying $m$ on XD and DG evaluations. This experiment uses the same settings as Tab. \ref{['tab:pretrained']}. Our word soup achieves the best accuracies for all $m$. This shows that word soup is more descriptor efficient than baseline ZS methods.
  • ...and 2 more figures