Table of Contents
Fetching ...

Data Augmentation via Causal-Residual Bootstrapping

Mateusz Gajewski, Sophia Xiao, Bijan Mazaheri

Abstract

Data augmentation integrates domain knowledge into a dataset by making domain-informed modifications to existing data points. For example, image data can be augmented by duplicating images in different tints or orientations, thereby incorporating the knowledge that images may vary in these dimensions. Recent work by Teshima and Sugiyama has explored the integration of causal knowledge (e.g, A causes B causes C) up to conditional independence equivalence. We suggest a related approach for settings with additive noise that can incorporate information beyond a Markov equivalence class. The approach, built on the principle of independent mechanisms, permutes the residuals of models built on marginal probability distributions. Predictive models built on our augmented data demonstrate improved accuracy, for which we provide theoretical backing in linear Gaussian settings.

Data Augmentation via Causal-Residual Bootstrapping

Abstract

Data augmentation integrates domain knowledge into a dataset by making domain-informed modifications to existing data points. For example, image data can be augmented by duplicating images in different tints or orientations, thereby incorporating the knowledge that images may vary in these dimensions. Recent work by Teshima and Sugiyama has explored the integration of causal knowledge (e.g, A causes B causes C) up to conditional independence equivalence. We suggest a related approach for settings with additive noise that can incorporate information beyond a Markov equivalence class. The approach, built on the principle of independent mechanisms, permutes the residuals of models built on marginal probability distributions. Predictive models built on our augmented data demonstrate improved accuracy, for which we provide theoretical backing in linear Gaussian settings.
Paper Structure (68 sections, 15 theorems, 79 equations, 9 figures, 19 tables, 1 algorithm)

This paper contains 68 sections, 15 theorems, 79 equations, 9 figures, 19 tables, 1 algorithm.

Key Result

Proposition 2.4

The learning phase of Causal-Residual Bootstrapping, which performs linear regression of each variable $V_j$ on its parents $\operatorname{\mathbf{PA}}(V_j)$, computes the maximum likelihood estimates under the DAG-constrained linear Gaussian model.

Figures (9)

  • Figure 1: Empirical validation of MSE improvement rate. Left: Simple chain $A \to B \to C$. Right: Confounded structure $A \to B \leftarrow D$, $B \to C$. Both configurations show MSE improvement following the predicted $1/N$ decay. In both cases B is the predicted value
  • Figure 2: Performance of the PC algorithm on datasets augmented by alternative methods. The plots show the average Structural Hamming Distance (SHD) between the true DAG and the estimated CPDAG returned by the PC algorithm as the number of augmented points increases; the shaded region indicates one standard deviation. Higher SHD indicates worse performance.
  • Figure 3: Performance of DirectLINGAM algorithm.
  • Figure 4: Mean MSE across all variables by augmentation method (Known Graph, $n=100$). Lower is better. CRB achieves the best performance, while non-causal augmenters often increase error relative to no augmentation.
  • Figure 5: Per-variable MSE comparison (Known Graph, $n=100$).
  • ...and 4 more figures

Theorems & Definitions (40)

  • Proposition 2.4: CRB Learning Phase Equals Constrained MLE
  • proof
  • Remark 2.5
  • Proposition 2.6: Causal Meaning of UDU Entries
  • Theorem 2.7: DAG Constraints in UDU Form
  • proof
  • Remark 2.8: Parameterization Perspective
  • Theorem 2.9: Variance Reduction from Known Parameters
  • proof
  • Remark 2.10: Intuition
  • ...and 30 more