Table of Contents
Fetching ...

The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

Yongwei Che, Benjamin Eysenbach

TL;DR

The paper investigates how to reason across unpaired modalities by treating contrastive representations as density ratios and marginalizing over an intermediate modality $B$. It proves that under plausible assumptions, a direct similarity between encodings of $A$ and $C$ can reflect the true posterior ratio $\frac{p(C|A)}{p(C)}$, via a monotone transform, and extends to unnormalized Gaussian representations. When assumptions fail, it offers a Monte Carlo LogSumExp algorithm to approximate the ratio efficiently, enabling practical cross-modal bridging with pretrained models and in language-conditioned RL. The approach is validated on synthetic data, large pretrained multimodal models (e.g., CLIP/CLAP, LanguageBind), and language-conditioned navigation tasks, showing improved robustness to ambiguity and data scarcity and highlighting the method’s practical impact for zero-shot inference and modality fusion.

Abstract

While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications.

The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

TL;DR

The paper investigates how to reason across unpaired modalities by treating contrastive representations as density ratios and marginalizing over an intermediate modality . It proves that under plausible assumptions, a direct similarity between encodings of and can reflect the true posterior ratio , via a monotone transform, and extends to unnormalized Gaussian representations. When assumptions fail, it offers a Monte Carlo LogSumExp algorithm to approximate the ratio efficiently, enabling practical cross-modal bridging with pretrained models and in language-conditioned RL. The approach is validated on synthetic data, large pretrained multimodal models (e.g., CLIP/CLAP, LanguageBind), and language-conditioned navigation tasks, showing improved robustness to ambiguity and data scarcity and highlighting the method’s practical impact for zero-shot inference and modality fusion.

Abstract

While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications.
Paper Structure (38 sections, 4 theorems, 18 equations, 13 figures, 2 tables)

This paper contains 38 sections, 4 theorems, 18 equations, 13 figures, 2 tables.

Key Result

Lemma 1

Let $\phi_A(a), \phi_B(b), \phi_C(c)$ be three encoders trained with contrastive learning on paired data $p(a, b)$ and $p(b, c)$. Assume that the encoder pairs $(\phi_A(a), \phi_B(b))$ and $(\phi_B(b), \phi_C(c))$ satisfy Assumption assumption:prob, and that our modalities $A, B, C$ satisfy Assumpti where $K_1, K_2$ respectively denote the constant multiplicative errors of $(\phi_A, \phi_B), (\phi

Figures (13)

  • Figure 1: Aligning vision and audio by integrating over the intermediate 'language' modality. While prior work has shown that aligning modalities $A \leftrightarrow B$ and $B \leftrightarrow C$ results in representations that can compare modalities A and C, it remains unclear when and why this is guaranteed to work. This paper provides the assumptions under which this approach is principled, and our analysis unlocks new ways of comparing unpaired modalities and new applications for contrastive learning.
  • Figure 2: Testing the "Law" of the Unconscious Contrastive Learning with three parametrizations of the critic function. We assess whether the "Law" holds by comparing the success of the "Direct" method to an oracle that is trained on $(A, C)$ examples. We also include the Monte Carlo method based on Lemma \ref{['lemma:1']} to understand the assumptions belying our analysis. (Left) For the L2 critic, all methods perform well, suggesting that all three assumptions are satisfied. (Center) For the dot product critic, the Direct method performs much worse than the Monte Carlo method, suggesting that Assumption \ref{['assumption:dist']} is violated. Fig. \ref{['fig:repr_dist']} confirms that Assumption \ref{['assumption:dist']} is violated. (Center) The performance of the Direct and Monte Carlo methods when using the normalized dot product suggests that the normalized dot product violates Assumption \ref{['assumption:prob']}, but that this assumption might not be necessary for the "Law" to hold.
  • Figure 3: A principled way of combining pre-trained models. Given pre-trained models that compute the similarities $A \leftrightarrow B$ and $B \leftrightarrow C$, we use Lemma \ref{['lemma:1']} to infer the similarity between $A \leftrightarrow C$.
  • Figure 4: (Left) Direct evaluation of the CLIP Image Encoder with the CLAP Audio Encoder for audio-visual inference versus using our LogSumExp algorithm with those same encoders. (Right) Direct evaluation with LanguageBind encoders vs our LogSumExp algorithm with the those same encoders. In Appendix Fig. 5 we show that the 12% gap between the Directe Evaluation and LogSumExp on LanguageBind is caused by using too few Monte Carlo samples; this gap shrinks to zero as the number of Monte Carlo samples is increased, in line with our theory.
  • Figure 5: The Accuracy of our LogSumExp (Monte Carlo) approximation scales with the number of intermediate embeddings. As the number of sampled embeddings (M) increases, the Monte Carlo method converges to direct evaluation performance with both ImageBind (left) and LanguageBind (right). The shaded regions indicate 95% confidence intervals across multiple trials, and the dashed red lines represent the accuracy of direct computation. Recall@1 is evaluated from a set of 25 samples. These results validate our theoretical analysis and support the underlying assumptions of our approach.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Lemma 1
  • proof
  • Lemma 2: The "Law" of the Unconscious Contrastive Learner
  • proof
  • Lemma 3
  • Lemma 4
  • proof
  • proof