The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities
Yongwei Che, Benjamin Eysenbach
TL;DR
The paper investigates how to reason across unpaired modalities by treating contrastive representations as density ratios and marginalizing over an intermediate modality $B$. It proves that under plausible assumptions, a direct similarity between encodings of $A$ and $C$ can reflect the true posterior ratio $\frac{p(C|A)}{p(C)}$, via a monotone transform, and extends to unnormalized Gaussian representations. When assumptions fail, it offers a Monte Carlo LogSumExp algorithm to approximate the ratio efficiently, enabling practical cross-modal bridging with pretrained models and in language-conditioned RL. The approach is validated on synthetic data, large pretrained multimodal models (e.g., CLIP/CLAP, LanguageBind), and language-conditioned navigation tasks, showing improved robustness to ambiguity and data scarcity and highlighting the method’s practical impact for zero-shot inference and modality fusion.
Abstract
While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications.
