Table of Contents
Fetching ...

MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning

Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li, Ruixuan Li

TL;DR

The paper addresses the limitations of unsupervised pretraining for few-shot learning by quantitatively analyzing contrastive learning and masked image modeling, then proposing Masked Image Contrastive Modeling (MICM) to blend their strengths. MICM uses an encoder-decoder architecture with a delayed class token, teacher-student self-distillation, and a hybrid loss that combines image reconstruction with contrastive objectives, within a two-stage U-FSL framework that includes OpTA and pseudo-label enhancements. Across MiniImageNet, TieredImageNet, CIFAR-FS, and cross-domain datasets, MICM achieves state-of-the-art results by improving both generalization to novel classes and discriminability of representations, while remaining adaptable to various FSL strategies. The work advances unsupervised pretraining for U-FSL by delivering a versatile, robust method that reduces reliance on labeled data and enhances transferability across diverse tasks.

Abstract

Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines.

MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning

TL;DR

The paper addresses the limitations of unsupervised pretraining for few-shot learning by quantitatively analyzing contrastive learning and masked image modeling, then proposing Masked Image Contrastive Modeling (MICM) to blend their strengths. MICM uses an encoder-decoder architecture with a delayed class token, teacher-student self-distillation, and a hybrid loss that combines image reconstruction with contrastive objectives, within a two-stage U-FSL framework that includes OpTA and pseudo-label enhancements. Across MiniImageNet, TieredImageNet, CIFAR-FS, and cross-domain datasets, MICM achieves state-of-the-art results by improving both generalization to novel classes and discriminability of representations, while remaining adaptable to various FSL strategies. The work advances unsupervised pretraining for U-FSL by delivering a versatile, robust method that reduces reliance on labeled data and enhances transferability across diverse tasks.

Abstract

Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines.
Paper Structure (26 sections, 7 equations, 10 figures, 12 tables)

This paper contains 26 sections, 7 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Illustrating the attention map with SimCLR chen2020simple (CL), MAE he2022masked (MIM), MICM.
  • Figure 2: (a) Contrastive Learning is dedicated to learning discriminative data representations by contrasting and differentiating between similar (positive) and dissimilar (negative) pairs of data samples. This approach emphasizes the relative comparison to achieve distinctiveness in the learned features. (b) Masked Image Modeling involves training a model to accurately predict the original content of intentionally obscured (masked) portions of images. This technique focuses on learning comprehensive and robust feature representations by encouraging the model to infer missing information. (c) Masked Image Contrastive Modeling implements a decoder that simultaneously enhances features and reconstructs the original content of images. This method synergistically merges the principles of contrastive representation learning with effective pretext task design, thereby integrating the strengths of both approaches to achieve more nuanced and effective learning.
  • Figure 3: Histograms depicting the distribution of similarity between features extracted by the model for novel and base classes. Our model (blue) extracts distinctive features for novel classes, contrasting with the SimCLR model from contrastive learning methods, which continues to focus on discriminative features of base classes for novel classes, resulting in similar features (brown). Fine-tuning with an adequate number of labeled samples is essential to address these issues in SimCLR and enhance classification accuracy.
  • Figure 4: Visualization of t-SNE features for various unsupervised pre-trained models in novel categories. The first row: before fine-tuning. The second row: after fine-tuning with 5-shot. Contrary to SimCLR and MICM, the MAE method lacks discriminative features both prior to and following few-shot learning. The performance of each model in FSL is indicated in parentheses.
  • Figure 5: (a) The Unsupervised Pretraining phase involves self-supervised learning on a large, unlabeled base dataset, crucial for developing initial representations. (b) The Few-Shot Learning phase could adapt to a variety of few-shot learning approaches, including both inductive and transductive methodologies.
  • ...and 5 more figures