Table of Contents
Fetching ...

UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning

Long Zhou, Fereshteh Shakeri, Aymen Sadraoui, Mounir Kaaniche, Jean-Christophe Pesquet, Ismail Ben Ayed

TL;DR

This work tackles the sensitivity of transductive few-shot learning to class-balance and entropy hyper-parameters. It introduces UNEM, an unrolled Generalized EM framework that learns layer-wise hyper-parameters $\lambda^{(\ell)}$ and $T^{(\ell)}$ by mapping GEM iterations to neural network layers, and supports both Gaussian (vision-only) and Dirichlet (vision-language) data models. By unrolling the optimization, UNEM achieves substantial improvements over prior iterative methods, with gains up to $10\%$ on vision-only and $7.5\%$ on vision-language benchmarks, and reduces reliance on manual hyper-parameter tuning. The method demonstrates strong performance across diverse datasets and model families, including CLIP-based tasks, highlighting its practical impact for robust transductive few-shot learning in vision and language-conditioned settings.

Abstract

Transductive few-shot learning has recently triggered wide attention in computer vision. Yet, current methods introduce key hyper-parameters, which control the prediction statistics of the test batches, such as the level of class balance, affecting performances significantly. Such hyper-parameters are empirically grid-searched over validation data, and their configurations may vary substantially with the target dataset and pre-training model, making such empirical searches both sub-optimal and computationally intractable. In this work, we advocate and introduce the unrolling paradigm, also referred to as "learning to optimize", in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyper-parameters. Specifically, we unroll a generalization of the ubiquitous Expectation-Maximization (EM) optimizer into a neural network architecture, mapping each of its iterates to a layer and learning a set of key hyper-parameters over validation data. Our unrolling approach covers various statistical feature distributions and pre-training paradigms, including recent foundational vision-language models and standard vision-only classifiers. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks, showing significant gains brought by the proposed unrolled EM algorithm over iterative variants. The achieved improvements reach up to 10% and 7.5% on vision-only and vision-language benchmarks, respectively.

UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning

TL;DR

This work tackles the sensitivity of transductive few-shot learning to class-balance and entropy hyper-parameters. It introduces UNEM, an unrolled Generalized EM framework that learns layer-wise hyper-parameters and by mapping GEM iterations to neural network layers, and supports both Gaussian (vision-only) and Dirichlet (vision-language) data models. By unrolling the optimization, UNEM achieves substantial improvements over prior iterative methods, with gains up to on vision-only and on vision-language benchmarks, and reduces reliance on manual hyper-parameter tuning. The method demonstrates strong performance across diverse datasets and model families, including CLIP-based tasks, highlighting its practical impact for robust transductive few-shot learning in vision and language-conditioned settings.

Abstract

Transductive few-shot learning has recently triggered wide attention in computer vision. Yet, current methods introduce key hyper-parameters, which control the prediction statistics of the test batches, such as the level of class balance, affecting performances significantly. Such hyper-parameters are empirically grid-searched over validation data, and their configurations may vary substantially with the target dataset and pre-training model, making such empirical searches both sub-optimal and computationally intractable. In this work, we advocate and introduce the unrolling paradigm, also referred to as "learning to optimize", in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyper-parameters. Specifically, we unroll a generalization of the ubiquitous Expectation-Maximization (EM) optimizer into a neural network architecture, mapping each of its iterates to a layer and learning a set of key hyper-parameters over validation data. Our unrolling approach covers various statistical feature distributions and pre-training paradigms, including recent foundational vision-language models and standard vision-only classifiers. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks, showing significant gains brought by the proposed unrolled EM algorithm over iterative variants. The achieved improvements reach up to 10% and 7.5% on vision-only and vision-language benchmarks, respectively.

Paper Structure

This paper contains 22 sections, 17 equations, 5 figures, 12 tables, 2 algorithms.

Figures (5)

  • Figure 1: Impact of the class-balance hyperparameter $\lambda$ on the accuracy of transductive few-shot classification. The accuracy results are obtained using the EM-Dirichlet algorithm EMDIRICHLET_CVPR2024 applied to vision-language models (with 4-shots). The plot shows that the choice of $\lambda$ has a strong impact on the performance, and that the optimal $\lambda$ (indicated with the star symbol) might vary by orders of magnitudes, depending on the target downstream dataset (e.g., the ten very different fine-grained classification datasets). Further comments on the optimal $\lambda$ values and the values chosen in EMDIRICHLET_CVPR2024 are provided in Section \ref{['sec:experiments']}. The values of the learned hyper-parameters based on the proposed unrolled algorithm are illustrated and analyzed in Appendix \ref{['app:add_res']}.
  • Figure 2: An overview of the unrolled GEM algorithm for a given iteration. Each iteration $\ell$ corresponds to a network layer $\mathcal{L}^{(\ell)}$. Each layer depends on the vector of hyperparameters $(\lambda^{(\ell)}, T^{(\ell)})$.
  • Figure 3: Overall architecture of the designed UNEM.
  • Figure 4: Illustration of the learned hyper-parameters $\lambda^{(\ell)}$ and $T^{(\ell)}$ across layers for CUB (with ResNet18 model), mini-ImageNet (with ResNet18 model) and mini-ImageNet (with WRN28-10 model).
  • Figure 5: Illustration of the learned hyper-parameters $\lambda^{(\ell)}$ and $T^{(\ell)}$ across layers for some datasets with vision-language models.