Table of Contents
Fetching ...

Enhancing Vision-Language Few-Shot Adaptation with Negative Learning

Ce Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie

TL;DR

This work proposes a Simple yet effective Negative Learning approach, SimNL, to more efficiently exploit the task-specific knowledge from few-shot labeled samples, and discovers a complementary set of negative features that define “what is not a {CLASS}.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have exhibited impressive zero-shot performance and transferability, allowing them to adapt to downstream tasks in a data-efficient manner. However, when only a few labeled samples are available, adapting VLMs to distinguish subtle differences between similar classes in specific downstream tasks remains challenging. In this work, we propose a Simple yet effective Negative Learning approach, SimNL, to more efficiently exploit the task-specific knowledge from few-shot labeled samples. Unlike previous methods that focus on identifying a set of representative positive features defining "what is a {CLASS}", SimNL discovers a complementary set of negative features that define "what is not a {CLASS}", providing additional insights that supplement the positive features to enhance task-specific recognition capability. Further, we identify that current adaptation approaches are particularly vulnerable to potential noise in the few-shot sample set. To mitigate this issue, we introduce a plug-and-play few-shot instance reweighting technique to suppress noisy outliers and amplify clean samples for more stable adaptation. Our extensive experimental results across 15 datasets validate that the proposed SimNL outperforms existing state-of-the-art methods on both few-shot learning and domain generalization tasks while achieving competitive computational efficiency. Code is available at https://github.com/zhangce01/SimNL.

Enhancing Vision-Language Few-Shot Adaptation with Negative Learning

TL;DR

This work proposes a Simple yet effective Negative Learning approach, SimNL, to more efficiently exploit the task-specific knowledge from few-shot labeled samples, and discovers a complementary set of negative features that define “what is not a {CLASS}.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have exhibited impressive zero-shot performance and transferability, allowing them to adapt to downstream tasks in a data-efficient manner. However, when only a few labeled samples are available, adapting VLMs to distinguish subtle differences between similar classes in specific downstream tasks remains challenging. In this work, we propose a Simple yet effective Negative Learning approach, SimNL, to more efficiently exploit the task-specific knowledge from few-shot labeled samples. Unlike previous methods that focus on identifying a set of representative positive features defining "what is a {CLASS}", SimNL discovers a complementary set of negative features that define "what is not a {CLASS}", providing additional insights that supplement the positive features to enhance task-specific recognition capability. Further, we identify that current adaptation approaches are particularly vulnerable to potential noise in the few-shot sample set. To mitigate this issue, we introduce a plug-and-play few-shot instance reweighting technique to suppress noisy outliers and amplify clean samples for more stable adaptation. Our extensive experimental results across 15 datasets validate that the proposed SimNL outperforms existing state-of-the-art methods on both few-shot learning and domain generalization tasks while achieving competitive computational efficiency. Code is available at https://github.com/zhangce01/SimNL.
Paper Structure (26 sections, 18 equations, 6 figures, 10 tables)

This paper contains 26 sections, 18 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Negative learning provides complementary information for more accurate recognition. (Left) Grad-CAM selvaraju2017grad visualization of the similarity heatmaps with the learned positive and negative features of the ground-truth class; (Middle) Similarities (scaled by 100) to the learned positive and negative features of five similar candidate classes. While the positive branch alone may fail to distinguish among some closely related classes, incorporating the negative classifier, which eliminates certain incorrect classes, enhances the model's ability to accurately identify the true class; (Right) Performance comparisons with other state-of-the-art methods in 16-shot scenarios.
  • Figure 2: An overview of our proposed SimNL. We construct and learn the positive and negative CLIP-based classifiers across visual and textual modalities. Given an image to be classified, the classification logit for a specific class increases when the image feature $f_v$ closely aligns with the corresponding positive features $f_t^+, f_v^+$ and diverges from negative features $f_t^-, f_v^-$.
  • Figure 3: Visualization of cosine similarities on ImageNet deng2009imagenet validation set. We present distributions of pairwise similarities between the input image feature and both the learned positive and negative features from textual (Left) and visual (Right) modalities.
  • Figure 4: Few-shot instance reweighting. (Left) The performance of Tip-Adapter-F zhang2022tip degrades drastically when label noise exists in the few-shot sample set; (Right) t-SNE van2008visualizing visualization of visual features for 4 random classes from the OxfordPets parkhi2012cats dataset, where some outliers are marked with red circles.
  • Figure 5: Performance comparisons on few-shot learning on 11 image classification datasets. For each dataset, we report the mean accuracy and 95% confidence interval over 3 random seeds of our SimNL on 1-/2-/4-/8-/16-shot settings.
  • ...and 1 more figures