Table of Contents
Fetching ...

IndiSeek learns information-guided disentangled representations

Yu Gui, Cong Ma, Zongming Ma

TL;DR

IndiSeek tackles the challenge of learning disentangled, information-preserving representations from multi-modal data by first extracting shared cross-modal features with CLIP and then enforcing modality-specific independence from these shared features via a reconstruction-guided bound on mutual information. The method uses an upper-bound NCE-CLUB term for disentanglement and a reconstruction-based surrogate for completeness, enabling robust extraction of modality-specific signals even under nonlinear dependencies and redundant shared information. Experiments on synthetic simulations, a CITE-seq dataset, and diverse MultiBench benchmarks show IndiSeek outperforms state-of-the-art disentanglement baselines and improves downstream task performance while maintaining computational efficiency. The work also outlines task-related extensions and practical guidance for parameter tuning, highlighting the broad applicability of principled information-guided disentanglement in real-world multi-modal applications.

Abstract

Learning disentangled representations is a fundamental task in multi-modal learning. In modern applications such as single-cell multi-omics, both shared and modality-specific features are critical for characterizing cell states and supporting downstream analyses. Ideally, modality-specific features should be independent of shared ones while also capturing all complementary information within each modality. This tradeoff is naturally expressed through information-theoretic criteria, but mutual-information-based objectives are difficult to estimate reliably, and their variational surrogates often underperform in practice. In this paper, we introduce IndiSeek, a novel disentangled representation learning approach that addresses this challenge by combining an independence-enforcing objective with a computationally efficient reconstruction loss that bounds conditional mutual information. This formulation explicitly balances independence and completeness, enabling principled extraction of modality-specific features. We demonstrate the effectiveness of IndiSeek on synthetic simulations, a CITE-seq dataset and multiple real-world multi-modal benchmarks.

IndiSeek learns information-guided disentangled representations

TL;DR

IndiSeek tackles the challenge of learning disentangled, information-preserving representations from multi-modal data by first extracting shared cross-modal features with CLIP and then enforcing modality-specific independence from these shared features via a reconstruction-guided bound on mutual information. The method uses an upper-bound NCE-CLUB term for disentanglement and a reconstruction-based surrogate for completeness, enabling robust extraction of modality-specific signals even under nonlinear dependencies and redundant shared information. Experiments on synthetic simulations, a CITE-seq dataset, and diverse MultiBench benchmarks show IndiSeek outperforms state-of-the-art disentanglement baselines and improves downstream task performance while maintaining computational efficiency. The work also outlines task-related extensions and practical guidance for parameter tuning, highlighting the broad applicability of principled information-guided disentanglement in real-world multi-modal applications.

Abstract

Learning disentangled representations is a fundamental task in multi-modal learning. In modern applications such as single-cell multi-omics, both shared and modality-specific features are critical for characterizing cell states and supporting downstream analyses. Ideally, modality-specific features should be independent of shared ones while also capturing all complementary information within each modality. This tradeoff is naturally expressed through information-theoretic criteria, but mutual-information-based objectives are difficult to estimate reliably, and their variational surrogates often underperform in practice. In this paper, we introduce IndiSeek, a novel disentangled representation learning approach that addresses this challenge by combining an independence-enforcing objective with a computationally efficient reconstruction loss that bounds conditional mutual information. This formulation explicitly balances independence and completeness, enabling principled extraction of modality-specific features. We demonstrate the effectiveness of IndiSeek on synthetic simulations, a CITE-seq dataset and multiple real-world multi-modal benchmarks.

Paper Structure

This paper contains 33 sections, 11 equations, 25 figures, 11 tables.

Figures (25)

  • Figure 1: Importance of learned modality-specific features: Setting 1.
  • Figure 2: Importance of learned modality-specific features: Setting 2.
  • Figure 3: IndiSeek: Information-guided Disentangled Representation Seeking.
  • Figure 4: Performance of IndiSeek in CITE-seq dataset ($\lambda=10.0$).
  • Figure 5: Comparison of rank correlation metrics across three methods on the CITE-seq dataset.
  • ...and 20 more figures