Data Augmentation via Causal-Residual Bootstrapping

Mateusz Gajewski; Sophia Xiao; Bijan Mazaheri

Data Augmentation via Causal-Residual Bootstrapping

Mateusz Gajewski, Sophia Xiao, Bijan Mazaheri

Abstract

Data augmentation integrates domain knowledge into a dataset by making domain-informed modifications to existing data points. For example, image data can be augmented by duplicating images in different tints or orientations, thereby incorporating the knowledge that images may vary in these dimensions. Recent work by Teshima and Sugiyama has explored the integration of causal knowledge (e.g, A causes B causes C) up to conditional independence equivalence. We suggest a related approach for settings with additive noise that can incorporate information beyond a Markov equivalence class. The approach, built on the principle of independent mechanisms, permutes the residuals of models built on marginal probability distributions. Predictive models built on our augmented data demonstrate improved accuracy, for which we provide theoretical backing in linear Gaussian settings.

Data Augmentation via Causal-Residual Bootstrapping

Abstract

Paper Structure (68 sections, 15 theorems, 79 equations, 9 figures, 19 tables, 1 algorithm)

This paper contains 68 sections, 15 theorems, 79 equations, 9 figures, 19 tables, 1 algorithm.

Introduction
Summary of Contributions
Related Works
Data Augmentation.
Incorporating Causal Constraints.
Causal Structure and Predictive Models.
Other Related Works.
Preliminaries
Notation Conventions
Structural Causal Models
Structural Equation Models
Causally Constrained Regression
Causal-Residual Bootstrapping (CRB)
Problem Setup and Input Specification
Augmentation Procedure
...and 53 more sections

Key Result

Proposition 2.4

The learning phase of Causal-Residual Bootstrapping, which performs linear regression of each variable $V_j$ on its parents $\operatorname{\mathbf{PA}}(V_j)$, computes the maximum likelihood estimates under the DAG-constrained linear Gaussian model.

Figures (9)

Figure 1: Empirical validation of MSE improvement rate. Left: Simple chain $A \to B \to C$. Right: Confounded structure $A \to B \leftarrow D$, $B \to C$. Both configurations show MSE improvement following the predicted $1/N$ decay. In both cases B is the predicted value
Figure 2: Performance of the PC algorithm on datasets augmented by alternative methods. The plots show the average Structural Hamming Distance (SHD) between the true DAG and the estimated CPDAG returned by the PC algorithm as the number of augmented points increases; the shaded region indicates one standard deviation. Higher SHD indicates worse performance.
Figure 3: Performance of DirectLINGAM algorithm.
Figure 4: Mean MSE across all variables by augmentation method (Known Graph, $n=100$). Lower is better. CRB achieves the best performance, while non-causal augmenters often increase error relative to no augmentation.
Figure 5: Per-variable MSE comparison (Known Graph, $n=100$).
...and 4 more figures

Theorems & Definitions (40)

Proposition 2.4: CRB Learning Phase Equals Constrained MLE
proof
Remark 2.5
Proposition 2.6: Causal Meaning of UDU Entries
Theorem 2.7: DAG Constraints in UDU Form
proof
Remark 2.8: Parameterization Perspective
Theorem 2.9: Variance Reduction from Known Parameters
proof
Remark 2.10: Intuition
...and 30 more

Data Augmentation via Causal-Residual Bootstrapping

Abstract

Data Augmentation via Causal-Residual Bootstrapping

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (40)