General Identifiability and Achievability for Causal Representation Learning
Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Ali Tajer
TL;DR
This work extends causal representation learning to a general nonparametric latent model with a diffeomorphic transformation to observations, and shows that two hard interventions per latent node suffice for perfect identifiability of the latent DAG and latent variables even when the environment-to-node mapping is unknown (uncoupled interventions). It introduces GSCALE-I, a score-based algorithm that leverages interventional score variations to recover the inverse transformation and latent factors, providing provable guarantees under both uncoupled and coupled settings and without faithfulness assumptions when observational data are available. The paper also proves identifiability results under coupled interventions with weaker discrepancy requirements and demonstrates the approach via synthetic experiments that confirm high recovery accuracy, while highlighting the importance of accurate score estimation. Overall, the results offer a constructive path to identifiability and practical latent-variable recovery in CRL under very general, nonparametric conditions, and establish a foundation for further reducing intervention requirements and enhancing score-estimation methods in real data scenarios.
Abstract
This paper focuses on causal representation learning (CRL) under a general nonparametric latent causal model and a general transformation model that maps the latent data to the observational data. It establishes identifiability and achievability results using two hard uncoupled interventions per node in the latent causal graph. Notably, one does not know which pair of intervention environments have the same node intervened (hence, uncoupled). For identifiability, the paper establishes that perfect recovery of the latent causal model and variables is guaranteed under uncoupled interventions. For achievability, an algorithm is designed that uses observational and interventional data and recovers the latent causal model and variables with provable guarantees. This algorithm leverages score variations across different environments to estimate the inverse of the transformer and, subsequently, the latent variables. The analysis, additionally, recovers the identifiability result for two hard coupled interventions, that is when metadata about the pair of environments that have the same node intervened is known. This paper also shows that when observational data is available, additional faithfulness assumptions that are adopted by the existing literature are unnecessary.
