Table of Contents
Fetching ...

Toward the Identifiability of Comparative Deep Generative Models

Romain Lopez, Jan-Christian Huetter, Ehsan Hajiramezanali, Jonathan Pritchard, Aviv Regev

TL;DR

This work develops a theory of identifiability for comparative deep generative models (DGMs), showing that when the mixing function is piecewise affine (as with ReLU networks), the latent subspaces governing target and background data can be identifiable. It extends nonlinear ICA results to counting data, establishing block-wise identifiability under Poisson/negative-binomial noise and clarifying limitations under Bernoulli noise. To operationalize the theory, the authors introduce MO-cVAE and CO-cVAE training paradigms that optimize across multiple data sets and enforce interpretable latent independence via constrained optimization, respectively. Empirical validation on synthetic data confirms identifiability and disentanglement under correct latent-dimension specification and demonstrates the mitigating effect of multi-objective and constrained regularization; a single-cell perturbation dataset further shows practical gains in recovering salient patterns while controlling leakage into the background. Overall, the paper provides a principled route to robust, interpretable comparison across heterogeneous data sources using DGMs, with broad implications for scientific analyses that rely on modular latent representations.

Abstract

Deep Generative Models (DGMs) are versatile tools for learning data representations while adequately incorporating domain knowledge such as the specification of conditional probability distributions. Recently proposed DGMs tackle the important task of comparing data sets from different sources. One such example is the setting of contrastive analysis that focuses on describing patterns that are enriched in a target data set compared to a background data set. The practical deployment of those models often assumes that DGMs naturally infer interpretable and modular latent representations, which is known to be an issue in practice. Consequently, existing methods often rely on ad-hoc regularization schemes, although without any theoretical grounding. Here, we propose a theory of identifiability for comparative DGMs by extending recent advances in the field of non-linear independent component analysis. We show that, while these models lack identifiability across a general class of mixing functions, they surprisingly become identifiable when the mixing function is piece-wise affine (e.g., parameterized by a ReLU neural network). We also investigate the impact of model misspecification, and empirically show that previously proposed regularization techniques for fitting comparative DGMs help with identifiability when the number of latent variables is not known in advance. Finally, we introduce a novel methodology for fitting comparative DGMs that improves the treatment of multiple data sources via multi-objective optimization and that helps adjust the hyperparameter for the regularization in an interpretable manner, using constrained optimization. We empirically validate our theory and new methodology using simulated data as well as a recent data set of genetic perturbations in cells profiled via single-cell RNA sequencing.

Toward the Identifiability of Comparative Deep Generative Models

TL;DR

This work develops a theory of identifiability for comparative deep generative models (DGMs), showing that when the mixing function is piecewise affine (as with ReLU networks), the latent subspaces governing target and background data can be identifiable. It extends nonlinear ICA results to counting data, establishing block-wise identifiability under Poisson/negative-binomial noise and clarifying limitations under Bernoulli noise. To operationalize the theory, the authors introduce MO-cVAE and CO-cVAE training paradigms that optimize across multiple data sets and enforce interpretable latent independence via constrained optimization, respectively. Empirical validation on synthetic data confirms identifiability and disentanglement under correct latent-dimension specification and demonstrates the mitigating effect of multi-objective and constrained regularization; a single-cell perturbation dataset further shows practical gains in recovering salient patterns while controlling leakage into the background. Overall, the paper provides a principled route to robust, interpretable comparison across heterogeneous data sources using DGMs, with broad implications for scientific analyses that rely on modular latent representations.

Abstract

Deep Generative Models (DGMs) are versatile tools for learning data representations while adequately incorporating domain knowledge such as the specification of conditional probability distributions. Recently proposed DGMs tackle the important task of comparing data sets from different sources. One such example is the setting of contrastive analysis that focuses on describing patterns that are enriched in a target data set compared to a background data set. The practical deployment of those models often assumes that DGMs naturally infer interpretable and modular latent representations, which is known to be an issue in practice. Consequently, existing methods often rely on ad-hoc regularization schemes, although without any theoretical grounding. Here, we propose a theory of identifiability for comparative DGMs by extending recent advances in the field of non-linear independent component analysis. We show that, while these models lack identifiability across a general class of mixing functions, they surprisingly become identifiable when the mixing function is piece-wise affine (e.g., parameterized by a ReLU neural network). We also investigate the impact of model misspecification, and empirically show that previously proposed regularization techniques for fitting comparative DGMs help with identifiability when the number of latent variables is not known in advance. Finally, we introduce a novel methodology for fitting comparative DGMs that improves the treatment of multiple data sources via multi-objective optimization and that helps adjust the hyperparameter for the regularization in an interpretable manner, using constrained optimization. We empirically validate our theory and new methodology using simulated data as well as a recent data set of genetic perturbations in cells profiled via single-cell RNA sequencing.
Paper Structure (77 sections, 13 theorems, 95 equations, 2 figures, 11 tables)

This paper contains 77 sections, 13 theorems, 95 equations, 2 figures, 11 tables.

Key Result

Theorem 1

Let the ground truth mixing function $f$ and the learned mixing function $\tilde{f}$ both be continuous and injective piece-wise affine mixing functions such that $f(\bm{z}, \bm{s}) {\,{\buildrel d \over =}\,} \tilde{f}(\bm{z}, \bm{s})$ and $f(\bm{z}, \bm{0}) {\,{\buildrel d \over =}\,} \tilde{f}(\b

Figures (2)

  • Figure 1: Presentation of the comparative deep generative models considered in this work.
  • Figure 2: UMAP visualization of salient and background spaces from SO-cVAE, MO-CO-cVAE, as well as ContrastiveVI. Each point is a cell. Cells are colored by their group of genetic perturbation, where groups were assigned based on biological annotation from the authors of norman2019exploring.

Theorems & Definitions (19)

  • Definition 1: Compatible map
  • Definition 2: Subspace Disentanglement
  • Definition 3: Subspace Identifiability
  • Example 1: Counterexample
  • Theorem 1: Identifiability Theorem
  • Theorem 2: Reduction from observational count noise to the noiseless setting
  • Proposition 1: Block-wise identifiability under misspecification for the linear case
  • Lemma 1
  • Theorem 4: Theorem D.4 from kivva2022identifiability
  • Theorem 2: Identifiability Theorem
  • ...and 9 more