Table of Contents
Fetching ...

Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities

Adriel Saporta, Aahlad Puli, Mark Goldstein, Rajesh Ranganath

TL;DR

Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations, and a lower bound on total correlation is derived, showing that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities.

Abstract

Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at https://github.com/rajesh-lab/symile.

Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities

TL;DR

Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations, and a lower bound on total correlation is derived, showing that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities.

Abstract

Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at https://github.com/rajesh-lab/symile.

Paper Structure

This paper contains 51 sections, 11 theorems, 88 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Given the distributions in eq:sampling_proc_ieq:sampling_proc, for any value $i$ of $\mathbf{i}$ and any symile-clf $g$, a multi-sample contrastive lower bound on total correlation is

Figures (6)

  • Figure 1: An illustrative comparison of the information captured by clip (only pairwise) and symile (both pairwise and higher-order).
  • Figure 2: symile pre-training and zero-shot prediction on the symile-m3 multilingual dataset. (a) Given a batch of triples, symile maximizes the multilinear inner product (mip) of positive triples (in yellow along the diagonal of the cube) and minimizes the mip of negative triples. (b) The model selects the candidate image with the highest similarity to the query audio and text.
  • Figure 3: The performance gap between symile and clip on binary synthetic data (left) is a consequence of the changing information dynamics between the variables as $\hat{p}$ moves from 0 to 1 (right). Mean accuracy is reported across 10 bootstrap samples of the test set.
  • Figure 4: (a) Data-generating process for symile-m3-5. (b) Comparison of symile and clip on the three versions of symile-m3 ($w \in \{2, 5, 10\}$). Random chance is $1/1000$. symile successfully leverages joint information between the modalities, whereas clip is limited to pairwise information, resulting in accuracies bounded by $1/w$. (c) symile outperforms the clip baseline on symile-m3-2 across varying levels of completeness in the training data. Both plots report mean accuracy across 10 bootstrap samples of the test set.
  • Figure 5: (a) Each sample of symile-mimic includes an ECG and blood labs taken within 24 hours of the patient's admission to the hospital, and a CXR taken in the 24- to 72-hour period post-admission. (b) Retrieval accuracy for identifying the CXR corresponding to a given ECG and labs pair. Results are averaged over 10 bootstrap samples, with error bars indicating standard error.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Theorem 3.1: Total Correlation Lower Bound
  • Lemma 3.2
  • Theorem 3.3: Symile Sufficient Statistics
  • Theorem 3.4: Conditional Distribution using the Scoring Function
  • Theorem B.1: Total Correlation Lower Bound
  • proof
  • Lemma C.1: Batch Sampling Procedure Properties
  • proof
  • Lemma D.1: Total Correlation for a Batch of Tuples
  • proof
  • ...and 8 more