Table of Contents
Fetching ...

JANUS: Structured Bidirectional Generation for Guaranteed Constraints and Analytical Uncertainty

Taha Racicot

TL;DR

This work introduces JANUS (Joint Ancestral Network for Uncertainty and Synthesis), a framework that unifies these capabilities using a DAG of Bayesian Decision Trees and introduces Reverse-Topological Back-filling, an algorithm that propagates constraints backwards through the causal graph, achieving 100% constraint satisfaction on feasible constraint sets without rejection sampling.

Abstract

High-stakes synthetic data generation faces a fundamental Quadrilemma: achieving Fidelity to the original distribution, Control over complex logical constraints, Reliability in uncertainty estimation, and Efficiency in computational cost -- simultaneously. State-of-the-art Deep Generative Models (CTGAN, TabDDPM) excel at fidelity but rely on inefficient rejection sampling for continuous range constraints. Conversely, Structural Causal Models offer logical control but struggle with high-dimensional fidelity and complex noise inversion. We introduce JANUS (Joint Ancestral Network for Uncertainty and Synthesis), a framework that unifies these capabilities using a DAG of Bayesian Decision Trees. Our key innovation is Reverse-Topological Back-filling, an algorithm that propagates constraints backwards through the causal graph, achieving 100% constraint satisfaction on feasible constraint sets without rejection sampling. This is paired with an Analytical Uncertainty Decomposition derived from Dirichlet priors, enabling 128x faster uncertainty estimation than Monte Carlo methods. Across 15 datasets and 523 constrained scenarios, JANUS achieves state-of-the-art fidelity (Detection Score 0.497), eliminates mode collapse on imbalanced data, and provides exact handling of complex inter-column constraints (e.g., Salary_offered >= Salary_requested) where baselines fail entirely.

JANUS: Structured Bidirectional Generation for Guaranteed Constraints and Analytical Uncertainty

TL;DR

This work introduces JANUS (Joint Ancestral Network for Uncertainty and Synthesis), a framework that unifies these capabilities using a DAG of Bayesian Decision Trees and introduces Reverse-Topological Back-filling, an algorithm that propagates constraints backwards through the causal graph, achieving 100% constraint satisfaction on feasible constraint sets without rejection sampling.

Abstract

High-stakes synthetic data generation faces a fundamental Quadrilemma: achieving Fidelity to the original distribution, Control over complex logical constraints, Reliability in uncertainty estimation, and Efficiency in computational cost -- simultaneously. State-of-the-art Deep Generative Models (CTGAN, TabDDPM) excel at fidelity but rely on inefficient rejection sampling for continuous range constraints. Conversely, Structural Causal Models offer logical control but struggle with high-dimensional fidelity and complex noise inversion. We introduce JANUS (Joint Ancestral Network for Uncertainty and Synthesis), a framework that unifies these capabilities using a DAG of Bayesian Decision Trees. Our key innovation is Reverse-Topological Back-filling, an algorithm that propagates constraints backwards through the causal graph, achieving 100% constraint satisfaction on feasible constraint sets without rejection sampling. This is paired with an Analytical Uncertainty Decomposition derived from Dirichlet priors, enabling 128x faster uncertainty estimation than Monte Carlo methods. Across 15 datasets and 523 constrained scenarios, JANUS achieves state-of-the-art fidelity (Detection Score 0.497), eliminates mode collapse on imbalanced data, and provides exact handling of complex inter-column constraints (e.g., Salary_offered >= Salary_requested) where baselines fail entirely.
Paper Structure (28 sections, 2 theorems, 5 equations, 9 figures, 16 tables, 1 algorithm)

This paper contains 28 sections, 2 theorems, 5 equations, 9 figures, 16 tables, 1 algorithm.

Key Result

Proposition 1

Under Assumption ass:feasibility, Algorithm alg:backfill produces samples satisfying all constraints with probability 1.

Figures (9)

  • Figure 1: JANUS Architecture. Left: Causal DAG where each node is a feature. Right: Each non-root node is modeled by a Bayesian Decision Tree. Leaves store dual information: Dirichlet posteriors $\boldsymbol{\alpha}$ for forward $P(Y|X)$ and histograms $H$ for backward $P(X|Y)$. Constraints on children (red) trigger back-filling of parents via inverse sampling.
  • Figure 2: Back-filling Algorithm.Phase 1: Given constraint "Loan=Approved", we propagate backward to find which Income values can satisfy it (e.g., Inc $>$ $80k). Phase 2: Sample parents (Age, Edu) normally, then sample Income from the filtered range, guaranteeing the constraint is satisfied without rejection.
  • Figure 3: Epistemic Uncertainty Validation. Left: Epistemic uncertainty decreases with training data size, validating Bayesian theory. Middle: Epistemic vs. aleatoric decomposition shows distinct uncertainty sources. Right: Model accuracy improves with more training data. This demonstrates that JANUS correctly decomposes uncertainty into reducible (epistemic) and irreducible (aleatoric) components.
  • Figure 4: Uncertainty Terrain Map. (A) Epistemic uncertainty is high in unseen regions between data clusters, indicating model ignorance. (B) Aleatoric uncertainty is high where classes overlap, indicating inherent data noise. This 2D visualization (PCA projection, 97.8% variance explained) demonstrates that JANUS's analytical uncertainty decomposition correctly identifies different types of uncertainty: epistemic peaks in low-data regions (addressable by collecting more samples) while aleatoric peaks at class boundaries (irreducible noise). The distinct spatial patterns validate the theoretical decomposition from Equations \ref{['eq:uncertainty_total']}--\ref{['eq:uncertainty_epistemic']}.
  • Figure 5: (a) Pareto frontier: JANUS (J) achieves best speed-quality tradeoff (top-left is optimal). D=DCM, T=TVAE, C=CTGAN, R=CAREFL. (b) The Computational Wall: Deep learning methods (CTGAN, TVAE, TabDDPM) require rejection sampling for constraints, causing sample counts to grow exponentially as constraints tighten ($y$-axis, log scale). JANUS remains flat at $y=1$ (guaranteed satisfaction without rejection). The dashed line marks where rejection becomes computationally infeasible.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1: Guaranteed Satisfaction
  • Theorem 1: Parametric Epistemic Convergence walter_technical_2009