Table of Contents
Fetching ...

Multi-View Causal Representation Learning with Partial Observability

Dingling Yao, Danru Xu, Sébastien Lachapelle, Sara Magliacane, Perouz Taslakian, Georg Martius, Julius von Kügelgen, Francesco Locatello

TL;DR

This work introduces a general framework for recovering latent content blocks from multiple partially observed views, under nonlinear mixtures and possible causal relationships. Central to the approach are content encoders that align shared content across views while enforcing invertibility through entropy regularization and projection mechanisms, enabling identifiability up to smooth bijections. The authors establish theoretical results (including an identifiability algebra) showing when and how content blocks can be recovered from various subsets of views, and demonstrate broad applicability by unifying prior nonlinear ICA, disentanglement, and causal representation learning results. Empirically, they validate the theory across synthetic and real-world multimodal datasets, showing that multiple blocks of latent content can be learned simultaneously and that prior methods emerge as special cases of the proposed framework. The work highlights the practical potential of leveraging multiple partial views to obtain finer-grained representations, while noting challenges such as non-convex optimization and finite-sample limitations, and pointing to future directions including interventions and causal marginal analysis.

Abstract

We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.

Multi-View Causal Representation Learning with Partial Observability

TL;DR

This work introduces a general framework for recovering latent content blocks from multiple partially observed views, under nonlinear mixtures and possible causal relationships. Central to the approach are content encoders that align shared content across views while enforcing invertibility through entropy regularization and projection mechanisms, enabling identifiability up to smooth bijections. The authors establish theoretical results (including an identifiability algebra) showing when and how content blocks can be recovered from various subsets of views, and demonstrate broad applicability by unifying prior nonlinear ICA, disentanglement, and causal representation learning results. Empirically, they validate the theory across synthetic and real-world multimodal datasets, showing that multiple blocks of latent content can be learned simultaneously and that prior methods emerge as special cases of the proposed framework. The work highlights the practical potential of leveraging multiple partial views to obtain finer-grained representations, while noting challenges such as non-convex optimization and finite-sample limitations, and pointing to future directions including interventions and causal marginal analysis.

Abstract

We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.
Paper Structure (31 sections, 17 theorems, 53 equations, 7 figures, 10 tables)

This paper contains 31 sections, 17 theorems, 53 equations, 7 figures, 10 tables.

Key Result

Theorem 3.2

Consider a set of views $\mathbf{x}_{V}$ satisfying assmp:general_assumption, and let $G$ be a set of content encoders (defn:content_encoders) that minimizes the following objective where the expectation is taken w.r.t. $p(\mathbf{x}_V)$ and $H(\cdot)$ denotes differential entropy. Then the shared content variable $\mathbf{z}_{C} := \{\mathbf{z}_j : j \in C\}$ is block-identified (defn:block-ide

Figures (7)

  • Figure 1: Multi-View Setting with Partial Observability, for \ref{['example:intuitive']} with $K{=}4$ views and $N{=}6$ latents. Each view $\mathbf{x}_k$ is generated by a subset $\mathbf{z}_{S_k}$ of the latent variables through a view-specific mixing function $f_k$. Directed arrows between latents indicate causal relations.
  • Figure 2: Theory Validation: Average $R^2$ across multiple views generated from independent latents.
  • Figure 3: Simultaneous Multi-Content Identification using View-Specific Encoders. Experimental results on Multimodal3DIdent. Left: Image latents (averaged between two image views) Right: Text latents.
  • Figure 4: Theory Verfication: Average $R^2$ across multiple views generated from causally dependent latents.
  • Figure 5: Causal3DIdent: Underlying causal relations and input examples.
  • ...and 2 more figures

Theorems & Definitions (27)

  • Example 2.1
  • Remark 2.2: "Content-Style" Terminology
  • Definition 2.3: Block-Identifiability
  • Definition 3.1: Content Encoders
  • Theorem 3.2: Identifiability from a Set of Views
  • Definition 3.3: View-Specific Encoders
  • Definition 3.4: Selection
  • Definition 3.5: Content Selectors
  • Definition 3.6: Projections
  • Definition 3.7: Information-Sharing Regularizer
  • ...and 17 more