Table of Contents
Fetching ...

Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral sampling

Xudong Sun, Alex Markham, Pratik Misra, Carsten Marr

TL;DR

This work identifies and analyzes critical pitfalls in implicit unobserved confounding synthesis, notably the restricted spectrum from diagonally dominant constructions of the idiosyncratic covariance Ω and the limited bidirected-edge structures in ancestral ADMG generation. It then introduces an explicit, block-hierarchical confounding synthesis approach that generates a ground-truth DAG, hides selected variables, and converts the result into an ancestral graph for evaluation, thereby ensuring broader coverage of the causal-model space. The explicit formulation shows that Ω can be expressed as Ω = Λ E(ξ ξ^T) Λ^T + E(ε_O ε_O^T), which, with suitable constraints, spans the space of symmetric positive definite matrices and connects to the implicit parameterization, enabling robust comparisons of causal-discovery methods. The proposed protocol supports heterogeneous graph structures, scalable ancestral sampling (including Wishart-based weight sampling), and principled DAG-to-ancestral-graph transformation, improving realism and diversity in synthetic benchmarks and providing a principled bridge between implicit and explicit confounding parameterizations.

Abstract

Unbiased data synthesis is crucial for evaluating causal discovery algorithms in the presence of unobserved confounding, given the scarcity of real-world datasets. A common approach, implicit parameterization, encodes unobserved confounding by modifying the off-diagonal entries of the idiosyncratic covariance matrix while preserving positive definiteness. Within this approach, we identify that state-of-the-art protocols have two distinct issues that hinder unbiased sampling from the complete space of causal models: first, we give a detailed analysis of use of diagonally dominant constructions restricts the spectrum of partial correlation matrices; and second, the restriction of possible graphical structures when sampling bidirected edges, unnecessarily ruling out valid causal models. To address these limitations, we propose an improved explicit modeling approach for unobserved confounding, leveraging block-hierarchical ancestral generation of ground truth causal graphs. Algorithms for converting the ground truth DAG into ancestral graph is provided so that the output of causal discovery algorithms could be compared with. We draw connections between implicit and explicit parameterization, prove that our approach fully covers the space of causal models, including those generated by the implicit parameterization, thus enabling more robust evaluation of methods for causal discovery and inference.

Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral sampling

TL;DR

This work identifies and analyzes critical pitfalls in implicit unobserved confounding synthesis, notably the restricted spectrum from diagonally dominant constructions of the idiosyncratic covariance Ω and the limited bidirected-edge structures in ancestral ADMG generation. It then introduces an explicit, block-hierarchical confounding synthesis approach that generates a ground-truth DAG, hides selected variables, and converts the result into an ancestral graph for evaluation, thereby ensuring broader coverage of the causal-model space. The explicit formulation shows that Ω can be expressed as Ω = Λ E(ξ ξ^T) Λ^T + E(ε_O ε_O^T), which, with suitable constraints, spans the space of symmetric positive definite matrices and connects to the implicit parameterization, enabling robust comparisons of causal-discovery methods. The proposed protocol supports heterogeneous graph structures, scalable ancestral sampling (including Wishart-based weight sampling), and principled DAG-to-ancestral-graph transformation, improving realism and diversity in synthetic benchmarks and providing a principled bridge between implicit and explicit confounding parameterizations.

Abstract

Unbiased data synthesis is crucial for evaluating causal discovery algorithms in the presence of unobserved confounding, given the scarcity of real-world datasets. A common approach, implicit parameterization, encodes unobserved confounding by modifying the off-diagonal entries of the idiosyncratic covariance matrix while preserving positive definiteness. Within this approach, we identify that state-of-the-art protocols have two distinct issues that hinder unbiased sampling from the complete space of causal models: first, we give a detailed analysis of use of diagonally dominant constructions restricts the spectrum of partial correlation matrices; and second, the restriction of possible graphical structures when sampling bidirected edges, unnecessarily ruling out valid causal models. To address these limitations, we propose an improved explicit modeling approach for unobserved confounding, leveraging block-hierarchical ancestral generation of ground truth causal graphs. Algorithms for converting the ground truth DAG into ancestral graph is provided so that the output of causal discovery algorithms could be compared with. We draw connections between implicit and explicit parameterization, prove that our approach fully covers the space of causal models, including those generated by the implicit parameterization, thus enabling more robust evaluation of methods for causal discovery and inference.

Paper Structure

This paper contains 37 sections, 45 theorems, 244 equations, 9 figures, 1 algorithm.

Key Result

Proposition 1

It holds that $\mathcal{S}_B \setminus \mathcal{A}_B \neq \emptyset$, i.e., the ancestral restricted joint space $\mathcal{A}_B$ does not cover the entire joint space $\mathcal{S}_B$.

Figures (9)

  • Figure 1: An example graph generated from our block-hierarchical data generation process with unobserved confounding where each block is represented as a rectangle what we call a macro node. In this example, unobserved node 4 is a child of observed node 1. Our algorithm allows control over joint dependencies across block structures.
  • Figure 2: Transformation of a DAG (left) into an ancestral graph (right) by marginalizing over the set of hidden nodes $U$. The conditional independence (CI) relationships are preserved; for example, in the left-hand-side (l.h.s.) DAG, $C \perp B \mid D$ holds, and this CI statement is maintained in the right-hand-side (r.h.s.) ancestral graph. Similarly, $A \perp D \mid B, C$ is valid in both representations.
  • Figure 3: Example of a DAG transformed to an ancestral graph after marginalizing over $U$, where $U$ is the child of an observable $F$. Due to the ancestral relationship between nodes $A$ and $C$, no bidirected edge is created.
  • Figure 4: An example DAG generated via our new data generation process and the corresponding ancestral ADMG obtained by hiding the root variable $X_0$.
  • Figure 5: An example DAG generated via our new data generation process and the corresponding ancestral ADMG when hiding the intermediate confounders $X_2$ and $X_5$.
  • ...and 4 more figures

Theorems & Definitions (133)

  • Remark 1: Source Vertex
  • Definition 1: Bidirected PD Cone sullivant2023algebraic
  • Definition 2: Implicit Parameterization of Unobserved Confounding
  • Definition 3: General Joint $(W, \Omega)$ Space
  • Definition 4: Ancestral Restricted Joint $(W, \Omega)$ Space
  • Proposition 1
  • proof
  • Remark 2
  • Definition 5: trek
  • Definition 6: trek monomial
  • ...and 123 more