Lightweight Cross-Modal Representation Learning

Bilal Faye; Hanane Azzag; Mustapha Lebbah; Djamel Bouchaffra

Lightweight Cross-Modal Representation Learning

Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra

TL;DR

LightCRL tackles the resource-intensity of cross-modal representation learning by freezing large pretrained encoders and training a single Deep Fusion Encoder to produce shared latent representations across modalities. A modality-specific context vector enables unified fusion and acts as a prior, while a contrastive objective aligns modalities in the latent space with two directional losses. The approach achieves competitive or superior performance to heavier baselines on zero-shot and linear classification tasks (e.g., CIFAR-10, CIFAR-100) and Tiny ImageNet, while drastically reducing trainable parameters. This makes cross-modal learning more accessible with limited data and compute, with strong transfer capabilities across tasks.

Abstract

Low-cost cross-modal representation learning is crucial for deriving semantic representations across diverse modalities such as text, audio, images, and video. Traditional approaches typically depend on large specialized models trained from scratch, requiring extensive datasets and resulting in high resource and time costs. To overcome these challenges, we introduce a novel approach named Lightweight Cross-Modal Representation Learning (LightCRL). This method uses a single neural network titled Deep Fusion Encoder (DFE), which projects data from multiple modalities into a shared latent representation space. This reduces the overall parameter count while still delivering robust performance comparable to more complex systems.

Lightweight Cross-Modal Representation Learning

TL;DR

Abstract

Paper Structure (4 sections, 3 equations, 1 figure, 3 tables)

This paper contains 4 sections, 3 equations, 1 figure, 3 tables.

Introduction
Proposed Method
Experiments
Conclusion

Figures (1)

Figure 1: LightCRL framework: Only DFE denoted $f$ is trained for cost-effectiveness, while $f_1$ and $f_2$ remain static. $\mathbf{m}^1_i$, $\mathbf{m}^2_i$ represent modalities $1$ and $2$ with their respective embedding $\mathbf{\Bar{m}}^1_i$ and $\mathbf{\Bar{m}}^2_i$, using respective frozen encoders. $\mathbf{\hat{m}}^1_i$ and $\mathbf{\hat{m}}^2_i$ are embeddings on the common latent space. $c_l$ is the context identifier for modality $l$, and $g_1$, $g_2$, and $g_3$ ensure uniform dimensions.

Lightweight Cross-Modal Representation Learning

TL;DR

Abstract

Lightweight Cross-Modal Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (1)