Table of Contents
Fetching ...

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

Jordi Armengol-Estapé, Vincent Michalski, Ramnath Kumar, Pierre-Luc St-Charles, Doina Precup, Samira Ebrahimi Kahou

TL;DR

This work investigates whether auxiliary language-based signals can improve few-shot visual classification by conditioning the main feature extractor through batch normalization. The authors propose SimpAux, a three-component architecture with a classifier, an auxiliary network that predicts language-based information from the input, and a bridge network that converts auxiliary embeddings into BN parameters to modulate the main network in a single forward pass. Experiments on CUB-200-2011 and mini-ImageNet show inconsistent gains: about 1.5 percentage points on CUB but no improvement or slight degradation on mini-ImageNet, with ablations attributing gains to the bridge's capacity rather than language information. The work provides practical guidance on evaluating multi-modal meta-learning and suggests architectural design choices and future directions, including leveraging language-pretrained encoders for robust few-shot learning.

Abstract

Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists of three components: a classifier, an auxiliary network, and a bridge network. While the classifier performs the main classification task, the auxiliary network learns to predict language representations from the same input, and the bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier using conditional batch normalization. The bridge should encourage a form of lightweight semantic alignment between language and vision which could be useful for the classifier. However, after evaluating the proposed approach on two popular few-shot classification benchmarks we find that a) the improvements do not reproduce across benchmarks, and b) when they do, the improvements are due to the additional compute and parameters introduced by the bridge network. We contribute insights and recommendations for future work in multi-modal meta-learning, especially when using language representations.

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

TL;DR

This work investigates whether auxiliary language-based signals can improve few-shot visual classification by conditioning the main feature extractor through batch normalization. The authors propose SimpAux, a three-component architecture with a classifier, an auxiliary network that predicts language-based information from the input, and a bridge network that converts auxiliary embeddings into BN parameters to modulate the main network in a single forward pass. Experiments on CUB-200-2011 and mini-ImageNet show inconsistent gains: about 1.5 percentage points on CUB but no improvement or slight degradation on mini-ImageNet, with ablations attributing gains to the bridge's capacity rather than language information. The work provides practical guidance on evaluating multi-modal meta-learning and suggests architectural design choices and future directions, including leveraging language-pretrained encoders for robust few-shot learning.

Abstract

Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists of three components: a classifier, an auxiliary network, and a bridge network. While the classifier performs the main classification task, the auxiliary network learns to predict language representations from the same input, and the bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier using conditional batch normalization. The bridge should encourage a form of lightweight semantic alignment between language and vision which could be useful for the classifier. However, after evaluating the proposed approach on two popular few-shot classification benchmarks we find that a) the improvements do not reproduce across benchmarks, and b) when they do, the improvements are due to the additional compute and parameters introduced by the bridge network. We contribute insights and recommendations for future work in multi-modal meta-learning, especially when using language representations.
Paper Structure (17 sections, 3 equations, 1 figure, 2 tables)

This paper contains 17 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Architectural overview of the method we experimented with. It consists of three components: a classifier, an auxiliary network, and a bridge network. The few-shot classifier and auxiliary network receive the same input example. The bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier through conditional batch normalization.