Table of Contents
Fetching ...

Data-Efficient Multimodal Fusion on a Single GPU

Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims Volkovs

TL;DR

This work tackles data- and compute-efficient multimodal fusion by bootstrapping from frozen, pre-trained unimodal encoders and learning lightweight fusion adapters to align latent representations in a shared space $\mathcal{S}$. It introduces FuseMix, a latent-space mixup augmentation that operates on the unimodal latent spaces $\mathcal{Z}_X$ and $\mathcal{Z}_Y$, paired with a symmetric contrastive objective $\mathcal{L}_{\text{sym}}^{\text{FuseMix}}$ to train the fusion adapters. The approach enables competitive or superior image-text and audio-text retrieval with orders of magnitude less data and compute, and even supports audio-to-image generation by aligning Whisper into CLIP space for conditioning GLIDE. The method is modular and plug-and-play, allowing seamless integration of newer unimodal encoders and encouraging data-efficient experimentation through analysis of dataset quantity, quality, and diversity.

Abstract

The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! 600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.

Data-Efficient Multimodal Fusion on a Single GPU

TL;DR

This work tackles data- and compute-efficient multimodal fusion by bootstrapping from frozen, pre-trained unimodal encoders and learning lightweight fusion adapters to align latent representations in a shared space . It introduces FuseMix, a latent-space mixup augmentation that operates on the unimodal latent spaces and , paired with a symmetric contrastive objective to train the fusion adapters. The approach enables competitive or superior image-text and audio-text retrieval with orders of magnitude less data and compute, and even supports audio-to-image generation by aligning Whisper into CLIP space for conditioning GLIDE. The method is modular and plug-and-play, allowing seamless integration of newer unimodal encoders and encouraging data-efficient experimentation through analysis of dataset quantity, quality, and diversity.

Abstract

The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with fewer GPU days and fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.
Paper Structure (20 sections, 7 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Text-to-image retrieval performance as a function of the number of image-text pairs used during training, evaluated on the Flickr30K test set young2014flickr30k. Note the $x$-axis is in log-scale.
  • Figure 2: A schematic of our proposed fusion framework to align the latent spaces of pre-trained unimodal encoders using a minimal set of paired data. The unimodal encoders are kept frozen, and their latent encodings are pre-computed only once. FuseMix applies mixup on each latent space, importantly sharing the mixing coefficient across modalities, and is used as a modality-agnostic data augmentation. Then, the lightweight fusion adapters are trained to align the resulting augmented latents into a shared latent space.
  • Figure 3: Measuring the effect of dataset quantity, quality, and diversity on downstream performance, evaluated using text-to-image retrieval on the Flickr30K test set. The $x$-axes indicate the relative/absolute number of image-text pairs, while H and W denote human and web-annotated, respectively. $\Delta$ R@1 (%) denotes relative improvement in Recall@1 compared to uniform subsampling.
  • Figure 4: Results of audio-to-image generation. The top row was generated from audio clips (accessible from the audio icons), and the bottom row was generated by describing the audio clips in text.
  • Figure 5: Measuring the effect of model size, batch size, and data augmentations on downstream performance, evaluated with the Flickr30k test set. GN denotes Gaussian noise with a standard deviation of 0.01 and RQ denotes random quantization. By default, R@1 denotes text-to-image Recall@1.
  • ...and 1 more figures