Table of Contents
Fetching ...

Group Contrastive Learning for Weakly Paired Multimodal Data

Aditya Gorla, Hugues Van Assel, Jan-Christian Huetter, Heming Yao, Kyunghyun Cho, Aviv Regev, Russell Littman

TL;DR

GROOVE addresses learning from weakly paired multimodal perturbation data by introducing GroupCLIP, a group-level supervised contrastive loss that uses shared perturbation labels to align modalities without instance-level pairs. It couples GroupCLIP with an on-the-fly backtranslating autoencoder, forming a shared latent space where cross-modal information is entangled, and it rigorously evaluates robustness via a combinatorial OT benchmarking framework that considers multiple aligners. Across simulations and real single-cell perturbation datasets, GROOVE achieves equal or superior cross-modal matching and imputation performance, with ablation studies showing GroupCLIP as the principal source of gains. The work highlights the value of group-level supervision for weakly paired multimodal learning and provides a versatile evaluation paradigm applicable to broader multi-omics integration challenges.

Abstract

We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing key limitations in existing evaluation strategies. This framework includes novel simulations that systematically vary shared versus modality-specific perturbation effects enabling principled assessment of method robustness. Our combinatorial benchmarking reveals that there is not yet an aligner that uniformly dominates across settings or modality pairs. Across simulations and two real single-cell genetic perturbation datasets, GROOVE performs on par with or outperforms existing approaches for downstream cross-modal matching and imputation tasks. Our ablation studies demonstrate that GroupCLIP is the key component driving performance gains. These results highlight the importance of leveraging group-level constraints for effective multi-modal representation learning in scenarios where only weak pairing is available.

Group Contrastive Learning for Weakly Paired Multimodal Data

TL;DR

GROOVE addresses learning from weakly paired multimodal perturbation data by introducing GroupCLIP, a group-level supervised contrastive loss that uses shared perturbation labels to align modalities without instance-level pairs. It couples GroupCLIP with an on-the-fly backtranslating autoencoder, forming a shared latent space where cross-modal information is entangled, and it rigorously evaluates robustness via a combinatorial OT benchmarking framework that considers multiple aligners. Across simulations and real single-cell perturbation datasets, GROOVE achieves equal or superior cross-modal matching and imputation performance, with ablation studies showing GroupCLIP as the principal source of gains. The work highlights the value of group-level supervision for weakly paired multimodal learning and provides a versatile evaluation paradigm applicable to broader multi-omics integration challenges.

Abstract

We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing key limitations in existing evaluation strategies. This framework includes novel simulations that systematically vary shared versus modality-specific perturbation effects enabling principled assessment of method robustness. Our combinatorial benchmarking reveals that there is not yet an aligner that uniformly dominates across settings or modality pairs. Across simulations and two real single-cell genetic perturbation datasets, GROOVE performs on par with or outperforms existing approaches for downstream cross-modal matching and imputation tasks. Our ablation studies demonstrate that GroupCLIP is the key component driving performance gains. These results highlight the importance of leveraging group-level constraints for effective multi-modal representation learning in scenarios where only weak pairing is available.
Paper Structure (37 sections, 40 equations, 2 figures, 12 tables, 1 algorithm)

This paper contains 37 sections, 40 equations, 2 figures, 12 tables, 1 algorithm.

Figures (2)

  • Figure 1: (a) GroupCLIP in the context for broader contrastive learning can be viewed as the multi-modal generalization of SupCon. (b) GROOVE Architecture and training step illustration. Each iteration consists of two steps: (1) optimize reconstruction and GroupCLIP losses, then (2) generate cross-modal pseudo-samples in inference mode and optimize the backtranslation loss.
  • Figure 2: Hyperparameter sensitivity landscape for matching performance. Contour plots show average performance across 100%, 80%, and 50% shared variation settings for each combination of $\beta$ (x-axis) and $\tau$ (y-axis), profiled using Optuna-based hyperparameter search. $\alpha=1$ for all analysis in this work. (a) Trace-based matching performance (higher is better). (b) Barycentric FOSCTTM (lower is better).