Table of Contents
Fetching ...

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Bilal Faye, Hanane Azzag, Mustapha Lebbah

TL;DR

OneEncoder tackles the high cost of cross-modal alignment by freezing large modality-specific encoders and training only a lightweight Universal Projection (UP). It progressively aligns modalities—starting with image-text and extending to audio and video—via a compact Alignment Layer (AL) and modality tokens, enabling transitive alignment in a shared embedding space. Across image-text, text-audio, and text-video tasks, OneEncoder often matches or surpasses heavy baselines like CLIP, AudioCLIP, and X-CLIP while using orders of magnitude fewer trainable parameters. The approach also extends to visual question answering, demonstrating reduced training cost and strong performance, and suggests broad practical impact for deploying multimodal systems with limited paired data.

Abstract

Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

TL;DR

OneEncoder tackles the high cost of cross-modal alignment by freezing large modality-specific encoders and training only a lightweight Universal Projection (UP). It progressively aligns modalities—starting with image-text and extending to audio and video—via a compact Alignment Layer (AL) and modality tokens, enabling transitive alignment in a shared embedding space. Across image-text, text-audio, and text-video tasks, OneEncoder often matches or surpasses heavy baselines like CLIP, AudioCLIP, and X-CLIP while using orders of magnitude fewer trainable parameters. The approach also extends to visual question answering, demonstrating reduced training cost and strong performance, and suggests broad practical impact for deploying multimodal systems with limited paired data.

Abstract

Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.
Paper Structure (17 sections, 3 equations, 8 figures, 9 tables, 3 algorithms)

This paper contains 17 sections, 3 equations, 8 figures, 9 tables, 3 algorithms.

Figures (8)

  • Figure 1: Comparison of three modality alignment methods: Standard cross-modal vs. OneEncoder. Standard aligns via simultaneous training of modality-specific encoders. OneEncoder uses frozen, pretrained encoders with a lightweight Universal Projection (UP) module trained on two modalities. For new modalities, UP stays frozen, training only the Alignment Layer. Modality tokens enable efficient switching between modalities. Using this method, video can be aligned with other modalities (image, text, audio) in the same way.
  • Figure 2: OneEncoder architecture. OneEncoder includes frozen pretrained modality-specific encoders, a Universal Projection module (UP), and an Alignment Layer (AL). In step 1, the UP, which consists of a Transformer encoder, is trained to align text and image modalities. In step 2, the pretrained UP is frozen, and the AL, composed of a multi-layer perceptron, is trained to align audio with the text and image modalities. During this step, either image or text is selected to align with audio, indirectly aligning audio with the non-selected modality. The UP fuses input ($\mathbf{x}_{\text{m}}$) and modality tokens ($\mathbf{t}_{\text{m}}$) to switch between modalities during a forward pass.
  • Figure 3: After training, OneEncoder can be used for various downstream tasks: in zero-shot mode by freezing the UP (Universal Projection) and AL (Alignment Layer), or fine-tuned for other tasks.
  • Figure 4: Qualitative results across three modalities. For each query, OneEncoder retrieves the most relevant data from the available dataset, showcasing the effectiveness of the alignment.
  • Figure 5: OneEncoder architecture for the Visual Question Answering (VQA) task. The OneEncoder framework in \ref{['fig:vqa']} trains only the UP module to align the textual answer with both the image and the textual question, unlike the baseline method in \ref{['fig:vqa_baseline']}, which trains all specific encoders (image encoder and text encoder), making it more computationally expensive. Both approaches use a "Prediction Head" to generate textual answers.
  • ...and 3 more figures