Table of Contents
Fetching ...

$\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery

Konstantin Göbler, Tobias Windisch, Mathias Drton, Tim Pychynski, Steffen Sonntag, Martin Roth

TL;DR

causalAssembly tackles the challenge of validating causal discovery methods on real-world data by creating semisynthetic datasets that preserve real data characteristics while imposing a ground-truth causal order. It combines domain knowledge from a production line with distributional random forests to learn conditionals and synthesize data that are Markov with respect to a learned layered DAG, enabling robust benchmarking beyond simplistic simulations. The approach is implemented in a Python library and demonstrated through initial benchmarks showing that standard causal discovery methods struggle on complex, real-like data, highlighting the value of models that exploit layered structure and flexible nonparametric conditionals. This has practical impact for privacy-preserving data sharing and for rigorous evaluation of causal discovery tools in industrial settings, with clear paths for extending to mixed data types and interventional studies.

Abstract

Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library $\texttt{causalAssembly}$. Using the library, we showcase how to benchmark several well-known causal discovery algorithms.

$\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery

TL;DR

causalAssembly tackles the challenge of validating causal discovery methods on real-world data by creating semisynthetic datasets that preserve real data characteristics while imposing a ground-truth causal order. It combines domain knowledge from a production line with distributional random forests to learn conditionals and synthesize data that are Markov with respect to a learned layered DAG, enabling robust benchmarking beyond simplistic simulations. The approach is implemented in a Python library and demonstrated through initial benchmarks showing that standard causal discovery methods struggle on complex, real-like data, highlighting the value of models that exploit layered structure and flexible nonparametric conditionals. This has practical impact for privacy-preserving data sharing and for rigorous evaluation of causal discovery tools in industrial settings, with clear paths for extending to mixed data types and interventional studies.

Abstract

Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library . Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
Paper Structure (29 sections, 1 theorem, 6 equations, 24 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 1 theorem, 6 equations, 24 figures, 4 tables, 2 algorithms.

Key Result

proposition 1

Let $L = (G,\mathcal{L})$ and $L^{\prime} = (G',\mathcal{L})$ be two layered DAGs that share the same vertex set $V$ and the same partition $\mathcal{L}=(V_1,\dots,V_K)$. If $L_s = L^\prime_s$ for all $s \in [K]$, then $\Pi_L = \Pi_{L^{\prime}}$.

Figures (24)

  • Figure 1: Illustration of the phases of a press-in and staking process. In press-fitting (a1) the tool moves downward axially until it contacts the valve (a2) and pushes it into a slightly smaller bore until its shoulder reaches the axial block position (a3).
  • Figure 2: Illustration of the production line with five process stations each containing two successive processes. The first station prepares components while station two and four carry out staking tasks. The remaining stations perform press-in tasks.
  • Figure 3: Assembly line ground truth after edges have been learned between processes using SpAM. We depict production stations, not the processes they are decomposed in. Node size increases with the number of out-edges. Node color gets brighter with the number of in-edges. In terms of sparsity, the assembly line ground truth graph accounts for around $10.2 \%$ of all possible connections.
  • Figure 4: Kernel density plots (upper panel) of the same variables in the real (blue) and semisynthetic (yellow) data. Selection of those variables is based on comparing the highest (first three) and lowest (last three) agreement in terms of Kolmogorov-Smirnov statistic.
  • Figure 5: Bivariate scatter plots of selected nodes to showcase the different types of bivariate patterns to expect in data generated with causalAssembly. All node pairs are causally linked with x-axis variables being parents of y-axis variables. The number of parents of each y-axis variable varies from one parent in (a), three parents in (b) and six parents in (c).
  • ...and 19 more figures

Theorems & Definitions (3)

  • proposition 1
  • remark 1
  • proof