Table of Contents
Fetching ...

MPA: Multimodal Prototype Augmentation for Few-Shot Learning

Liwen Wu, Wei Wang, Lei Zhao, Zhan Gao, Qika Lin, Shaowen Yao, Zuozhu Liu, Bin Pu

TL;DR

A novel Multimodal Prototype Augmentation FSL framework called MPA is proposed, including LLM-based Multi-Variant Semantic Enhancement, Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA), which achieves superior performance compared to existing state-of-the-art methods across most settings.

Abstract

Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.

MPA: Multimodal Prototype Augmentation for Few-Shot Learning

TL;DR

A novel Multimodal Prototype Augmentation FSL framework called MPA is proposed, including LLM-based Multi-Variant Semantic Enhancement, Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA), which achieves superior performance compared to existing state-of-the-art methods across most settings.

Abstract

Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.
Paper Structure (29 sections, 11 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 29 sections, 11 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: Unlike existing methods that typically focus on original image feature-based prototype, MPA integrates multimodal feature-based prototype (i.e., LLM-based multi-variant semantic features and hierarchical Multi-View features), significantly improving the model’s generalization performance and robustness across diverse tasks.
  • Figure 2: Overview of MPA. Our framework includes three components: LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and Adaptive Uncertain Class Absorber (AUCA). LMSE produces high-quality semantic features by leveraging LLM, reducing reliance on visual inputs for prototype construction. HMA improves data diversity and feature robustness through multi-view and feature-level augmentations. AUCA mitigates sample bias by interpolating between prototypes and sampling from a normal distribution, where the interpolation weight $\lambda$ is adaptively determined based on prototype differences. Finally, logistic regression is applied to optimized features for classification.
  • Figure 3: Feature visualization of MPA on public datasets.
  • Figure 4: UMAP visualization.