Table of Contents
Fetching ...

Identifiability of a statistical model with two latent vectors: Importance of the dimensionality relation and application to graph embedding

Hiroaki Sasaki

TL;DR

The paper advances identifiability in unsupervised representation learning by proposing a two-latent-vector model with a single auxiliary datum, allowing arbitrary latent dimensions and deriving dimensionality-based identifiability conditions. It establishes partial and full identifiability results, including a scenario where nonlinear indeterminacies are removed so $g(oldsymbol{x})$ recovers $oldsymbol{s}$ up to permutation and scaling, and it frames a reverse-generative perspective. The theory is applied to graph embedding, yielding a practical identifiable embedding method—Graph Component Analysis (GCA)—based on density-ratio estimation, with identifiability for graphs depending on the maximum link weight $K$ relative to the latent dimension $d_{ ext{s}}$. Empirical results on artificial data corroborate the theory, showing that identifiability improves when $K\ge d_{ ext{s}}$ and that recovery of latent components is feasible when $d_{ ext{s}}\le d_{ ext{x}}$. Overall, the work generalizes nonlinear ICA with auxiliary data to flexible two-latent-vector models and provides a concrete, identifiable graph embedding approach with theoretical guarantees.

Abstract

Identifiability of statistical models is a key notion in unsupervised representation learning. Recent work of nonlinear independent component analysis (ICA) employs auxiliary data and has established identifiable conditions. This paper proposes a statistical model of two latent vectors with single auxiliary data generalizing nonlinear ICA, and establishes various identifiability conditions. Unlike previous work, the two latent vectors in the proposed model can have arbitrary dimensions, and this property enables us to reveal an insightful dimensionality relation among two latent vectors and auxiliary data in identifiability conditions. Furthermore, surprisingly, we prove that the indeterminacies of the proposed model has the same as \emph{linear} ICA under certain conditions: The elements in the latent vector can be recovered up to their permutation and scales. Next, we apply the identifiability theory to a statistical model for graph data. As a result, one of the identifiability conditions includes an appealing implication: Identifiability of the statistical model could depend on the maximum value of link weights in graph data. Then, we propose a practical method for identifiable graph embedding. Finally, we numerically demonstrate that the proposed method well-recovers the latent vectors and model identifiability clearly depends on the maximum value of link weights, which supports the implication of our theoretical results

Identifiability of a statistical model with two latent vectors: Importance of the dimensionality relation and application to graph embedding

TL;DR

The paper advances identifiability in unsupervised representation learning by proposing a two-latent-vector model with a single auxiliary datum, allowing arbitrary latent dimensions and deriving dimensionality-based identifiability conditions. It establishes partial and full identifiability results, including a scenario where nonlinear indeterminacies are removed so recovers up to permutation and scaling, and it frames a reverse-generative perspective. The theory is applied to graph embedding, yielding a practical identifiable embedding method—Graph Component Analysis (GCA)—based on density-ratio estimation, with identifiability for graphs depending on the maximum link weight relative to the latent dimension . Empirical results on artificial data corroborate the theory, showing that identifiability improves when and that recovery of latent components is feasible when . Overall, the work generalizes nonlinear ICA with auxiliary data to flexible two-latent-vector models and provides a concrete, identifiable graph embedding approach with theoretical guarantees.

Abstract

Identifiability of statistical models is a key notion in unsupervised representation learning. Recent work of nonlinear independent component analysis (ICA) employs auxiliary data and has established identifiable conditions. This paper proposes a statistical model of two latent vectors with single auxiliary data generalizing nonlinear ICA, and establishes various identifiability conditions. Unlike previous work, the two latent vectors in the proposed model can have arbitrary dimensions, and this property enables us to reveal an insightful dimensionality relation among two latent vectors and auxiliary data in identifiability conditions. Furthermore, surprisingly, we prove that the indeterminacies of the proposed model has the same as \emph{linear} ICA under certain conditions: The elements in the latent vector can be recovered up to their permutation and scales. Next, we apply the identifiability theory to a statistical model for graph data. As a result, one of the identifiability conditions includes an appealing implication: Identifiability of the statistical model could depend on the maximum value of link weights in graph data. Then, we propose a practical method for identifiable graph embedding. Finally, we numerically demonstrate that the proposed method well-recovers the latent vectors and model identifiability clearly depends on the maximum value of link weights, which supports the implication of our theoretical results
Paper Structure (22 sections, 7 theorems, 57 equations, 2 figures)

This paper contains 22 sections, 7 theorems, 57 equations, 2 figures.

Key Result

Theorem 1

Suppose that the following assumptions hold: Then, $p_{\mathrm{xu}|\mathrm{w}}^{\bm{g},\bm{\phi}}$ is partially identifiable with respective to $\bm{g}$ up to a permutation of the elements in $\bm{g}$ and elementwise nonlinear functions.

Figures (2)

  • Figure 1: Mean absolute correlation over the latent dimension $d_{\mathrm{s}}$ when $(d_{\mathrm{x}}, K, n)=(6, 10, 10000)$. The red and blue lines are for GCA and EBM, respectively. Each marker represents the average of the mean absolute correlations over $10$ runs, and the shaded region is the standard deviation.
  • Figure 2: Mean absolute correlation over the number of link states when $d_{\mathrm{s}}=d_{\mathrm{x}}$ and $(d_{\mathrm{x}}, n)=(6, 10000)$. The red and blue lines are for GCA and EBM, respectively. For EBM, the blue line is flat because link weights are not required and the maximum link state is fixed at $K=10$ in learning EBM.

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Proposition 4
  • Corollary 5
  • Lemma 6: Lemma 10 in sasaki2022representation
  • Lemma 7
  • proof
  • proof
  • proof
  • ...and 2 more