Table of Contents
Fetching ...

Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning

Hao Tang, Shengfeng He, Jing Qin

TL;DR

SynTrans tackles few-shot learning by transferring diverse knowledge from large multimodal models to a lightweight learner without fine-tuning the vision backbone. It distills visual knowledge via a CLIP-based teacher, generates rich semantic descriptors through SynMine, and bridges visual and semantic spaces with VSBird to produce robust, category-specific classifier weights. A meta-learning-based visual-semantic fusion (with a learnable generator and reconstructor) yields final multimodal classifiers that outperform state-of-the-art on four benchmarks, especially in 5-way 1-shot tasks. The approach demonstrates that high-quality external semantic knowledge, when properly encoded and fused with visual cues, can significantly mitigate data scarcity in FSL and bridge human-like intuition and machine learning in vision tasks.

Abstract

Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data's inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.

Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning

TL;DR

SynTrans tackles few-shot learning by transferring diverse knowledge from large multimodal models to a lightweight learner without fine-tuning the vision backbone. It distills visual knowledge via a CLIP-based teacher, generates rich semantic descriptors through SynMine, and bridges visual and semantic spaces with VSBird to produce robust, category-specific classifier weights. A meta-learning-based visual-semantic fusion (with a learnable generator and reconstructor) yields final multimodal classifiers that outperform state-of-the-art on four benchmarks, especially in 5-way 1-shot tasks. The approach demonstrates that high-quality external semantic knowledge, when properly encoded and fused with visual cues, can significantly mitigate data scarcity in FSL and bridge human-like intuition and machine learning in vision tasks.

Abstract

Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data's inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.

Paper Structure

This paper contains 27 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The pipeline of the proposed Synergistic Knowledge Transfer (SynTrans) framework.
  • Figure 2: The pipeline of how the proposed SynMine module generates high-quality semantic descriptors.
  • Figure 3: The pipeline of the proposed Visual-Semantic Bridging (VSBird) module.
  • Figure 4: Influence of weight coefficient $\alpha$ on MiniImageNet.
  • Figure 5: t-SNE visualization of the classification weights for all novel categories in Mini-ImageNet. (a) 1-shot visual-based classifier. (b) 1-shot multi-modal based classifier. (c) 5-shot visual-based classifier. (d) 5-shot multi-modal based classifier.