Identifiability of a statistical model with two latent vectors: Importance of the dimensionality relation and application to graph embedding
Hiroaki Sasaki
TL;DR
The paper advances identifiability in unsupervised representation learning by proposing a two-latent-vector model with a single auxiliary datum, allowing arbitrary latent dimensions and deriving dimensionality-based identifiability conditions. It establishes partial and full identifiability results, including a scenario where nonlinear indeterminacies are removed so $g(oldsymbol{x})$ recovers $oldsymbol{s}$ up to permutation and scaling, and it frames a reverse-generative perspective. The theory is applied to graph embedding, yielding a practical identifiable embedding method—Graph Component Analysis (GCA)—based on density-ratio estimation, with identifiability for graphs depending on the maximum link weight $K$ relative to the latent dimension $d_{ ext{s}}$. Empirical results on artificial data corroborate the theory, showing that identifiability improves when $K\ge d_{ ext{s}}$ and that recovery of latent components is feasible when $d_{ ext{s}}\le d_{ ext{x}}$. Overall, the work generalizes nonlinear ICA with auxiliary data to flexible two-latent-vector models and provides a concrete, identifiable graph embedding approach with theoretical guarantees.
Abstract
Identifiability of statistical models is a key notion in unsupervised representation learning. Recent work of nonlinear independent component analysis (ICA) employs auxiliary data and has established identifiable conditions. This paper proposes a statistical model of two latent vectors with single auxiliary data generalizing nonlinear ICA, and establishes various identifiability conditions. Unlike previous work, the two latent vectors in the proposed model can have arbitrary dimensions, and this property enables us to reveal an insightful dimensionality relation among two latent vectors and auxiliary data in identifiability conditions. Furthermore, surprisingly, we prove that the indeterminacies of the proposed model has the same as \emph{linear} ICA under certain conditions: The elements in the latent vector can be recovered up to their permutation and scales. Next, we apply the identifiability theory to a statistical model for graph data. As a result, one of the identifiability conditions includes an appealing implication: Identifiability of the statistical model could depend on the maximum value of link weights in graph data. Then, we propose a practical method for identifiable graph embedding. Finally, we numerically demonstrate that the proposed method well-recovers the latent vectors and model identifiability clearly depends on the maximum value of link weights, which supports the implication of our theoretical results
