Table of Contents
Fetching ...

Order-based Structure Learning with Normalizing Flows

Hamidreza Kamkari, Vahid Balazadeh, Vahid Zehtab, Rahul G. Krishnan

TL;DR

The paper addresses the challenge of learning causal structure from observational data by relaxing the common additive-noise model (ANM) assumption through autoregressive normalizing flows (ANFs) and by framing structure learning as a search over topological orderings. It introduces OSLow, which uses a masked-flow ensemble to model multiple orderings and a differentiable permutation-learning objective based on a Boltzmann distribution over permutation matrices, enabling gradient-based optimization over the discrete order space. The authors prove strong identifiability results for the data complexity class of restricted location-scale noise models (LSNMs) with affine ANFs and demonstrate state-of-the-art performance on Sachs and SynTReN, including accurate interventional distribution estimation from observational data. The work demonstrates that relaxing ANM assumptions can yield practical gains in real-world causal discovery, with potential for broader applicability and extensions to more general non-linear post-flow models.

Abstract

Estimating the causal structure of observational data is a challenging combinatorial search problem that scales super-exponentially with graph size. Existing methods use continuous relaxations to make this problem computationally tractable but often restrict the data-generating process to additive noise models (ANMs) through explicit or implicit assumptions. We present Order-based Structure Learning with Normalizing Flows (OSLow), a framework that relaxes these assumptions using autoregressive normalizing flows. We leverage the insight that searching over topological orderings is a natural way to enforce acyclicity in structure discovery and propose a novel, differentiable permutation learning method to find such orderings. Through extensive experiments on synthetic and real-world data, we demonstrate that OSLow outperforms prior baselines and improves performance on the observational Sachs and SynTReN datasets as measured by structural hamming distance and structural intervention distance, highlighting the importance of relaxing the ANM assumption made by existing methods.

Order-based Structure Learning with Normalizing Flows

TL;DR

The paper addresses the challenge of learning causal structure from observational data by relaxing the common additive-noise model (ANM) assumption through autoregressive normalizing flows (ANFs) and by framing structure learning as a search over topological orderings. It introduces OSLow, which uses a masked-flow ensemble to model multiple orderings and a differentiable permutation-learning objective based on a Boltzmann distribution over permutation matrices, enabling gradient-based optimization over the discrete order space. The authors prove strong identifiability results for the data complexity class of restricted location-scale noise models (LSNMs) with affine ANFs and demonstrate state-of-the-art performance on Sachs and SynTReN, including accurate interventional distribution estimation from observational data. The work demonstrates that relaxing ANM assumptions can yield practical gains in real-world causal discovery, with potential for broader applicability and extensions to more general non-linear post-flow models.

Abstract

Estimating the causal structure of observational data is a challenging combinatorial search problem that scales super-exponentially with graph size. Existing methods use continuous relaxations to make this problem computationally tractable but often restrict the data-generating process to additive noise models (ANMs) through explicit or implicit assumptions. We present Order-based Structure Learning with Normalizing Flows (OSLow), a framework that relaxes these assumptions using autoregressive normalizing flows. We leverage the insight that searching over topological orderings is a natural way to enforce acyclicity in structure discovery and propose a novel, differentiable permutation learning method to find such orderings. Through extensive experiments on synthetic and real-world data, we demonstrate that OSLow outperforms prior baselines and improves performance on the observational Sachs and SynTReN datasets as measured by structural hamming distance and structural intervention distance, highlighting the importance of relaxing the ANM assumption made by existing methods.
Paper Structure (26 sections, 9 theorems, 39 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 9 theorems, 39 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

Proposition 4.0

Consider an SCM $\mathcal{M}$ from the family of restricted LSNMs and denote $\Theta$ as the set of all "conventional" affine ANFs w.r.t. $\mathcal{M}$. For each permutation $\pi$, let $\ell^*_\infty(\pi)$ be the minimum expected negative log-likelihood under this model. That is, Denote $\Pi_\mathcal{G}$ as the set of all valid causal orderings of $\mathcal{G}$. Then, $\forall \pi \in \Pi_{\mathc

Figures (5)

  • Figure 1: A summary of OSLow for a ($d$ = 3) dimensional structure learning example that is formulated as searching over the set of all possible permutation matrices $\{\mathbf{P}_i\}_{i=1}^{d!}$. Upper Right: Demonstrating how a permutation matrix $\mathbf{P}_i$ shapes dependencies in our ANF architecture, leading to the formulation of the negative log-likelihood $\ell^*_N(\mathbf{P}_i)$ corresponding to that ordering on the entire observational dataset of size $N$. Lower Left: A Boltzmann distribution $\alpha(\cdot)$ is defined over all the permutations and is parameterized by $\mathbf{\Gamma} \in \mathbb{R}^{d\times d}$ with the energy corresponding to $\mathbf{P}_i$ defined as the inner product of $\mathbf{\Gamma}$ and $\mathbf{P}_i$. We estimate the Boltzmann probability masses (detailed in the method) and set it to zero for permutations that are unlikely to be sampled. Hashed bars indicate true probability mass $\alpha(\cdot)$, while solid bars show the estimated probability $\hat{\alpha}(\cdot)$. Lower Right: The combination of the estimated distribution and flow likelihoods result in an overall differentiable loss $\widehat{\mathcal{L}}$ for order learning.Upper Left: Visualizing $\widehat{\mathcal{L}}$ over the entire Birkhoff polytope. The blue hue indicates how close are we to an ordering that produces the maximum likelihood with $\mathbf{P}_6$ being that ordering.
  • Figure 2: Estimating the interventional expected value $\mathbb{E}[X_5 | do(X_1)]$ using OSLow in a full causal graph from $X_1$ to $X_5$. The estimated value matches the true expectation in the $99\%$ confidence interval of the observational data.
  • Figure 3: Network architecture of OSLow for the special case of affine ANFs with block masking matrices. The causal ordering of variables is $\langle X_2, X_3, X_1 \rangle$. Each colour represents the neurons corresponding to one variable. In the final layer, self-connections are absent, and each neuron depends solely on inputs with smaller labels.
  • Figure 4: Estimating the interventional expected value $\mathbb{E}[X_2, \ldots, X_5 | do(X_1)]$ using OSLow in a tournaments causal graph from $X_1$ to $X_5$. The estimated value matches the true expectation in the $99\%$ confidence interval of the observational data.
  • Figure 5: Estimation of interventional expected values in a causal path with $X_1$ and $X_3$ being the first and last nodes.

Theorems & Definitions (20)

  • Proposition 4.0
  • Definition A.1: Causal Minimality
  • Definition A.2: Non-constant SCMs
  • Lemma A.1
  • proof
  • Corollary A.2
  • proof
  • Definition B.1
  • Definition B.2: Bivariate Identifiability
  • Lemma B.1: Proposition 29. peters2014causal
  • ...and 10 more