Table of Contents
Fetching ...

Empirically Calibrated Conditional Independence Tests

Milleno Pan, Antoine de Mathelin, Wesley Tansey

TL;DR

Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration, achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic.

Abstract

Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guarantees in practice. We highlight two common failure modes: (i) in small samples, asymptotic guarantees for many CITs can be inaccurate and even correctly specified models fail to estimate the noise levels and control the error, and (ii) when sample sizes are large but models are misspecified, unaccounted dependencies skew the test's behavior and fail to return uniform p-values under the null. We propose Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration. For a chosen base CIT (e.g., GCM, HRT), ECCIT optimizes an adversary that selects features and response functions to maximize a miscalibration metric. ECCIT then fits a monotone calibration map that adjusts the base-test p-values in proportion to the observed miscalibration. Across empirical benchmarks on synthetic and real data, ECCIT achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic.

Empirically Calibrated Conditional Independence Tests

TL;DR

Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration, achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic.

Abstract

Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guarantees in practice. We highlight two common failure modes: (i) in small samples, asymptotic guarantees for many CITs can be inaccurate and even correctly specified models fail to estimate the noise levels and control the error, and (ii) when sample sizes are large but models are misspecified, unaccounted dependencies skew the test's behavior and fail to return uniform p-values under the null. We propose Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration. For a chosen base CIT (e.g., GCM, HRT), ECCIT optimizes an adversary that selects features and response functions to maximize a miscalibration metric. ECCIT then fits a monotone calibration map that adjusts the base-test p-values in proportion to the observed miscalibration. Across empirical benchmarks on synthetic and real data, ECCIT achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic.
Paper Structure (37 sections, 1 theorem, 55 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 55 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Assume the true conditional law $Y\mid X$ lies in the adversary class used to compute the FDP metric function $\varphi_{\mathrm{FDP}}(\cdot)$. Then running BH at level $\alpha_{\mathrm{cal}}$ satisfies Moreover, $\alpha_{\mathrm{cal}}\le \alpha$, so the adjustment is conservative whenever $\varphi_{\mathrm{FDP}}(\alpha)>\alpha$.

Figures (10)

  • Figure 1: Single Experiment Performance. Realized Type-I error and power versus nominal $\alpha$ for raw and calibrated HRT on a correlated dataset.
  • Figure 2: Miscalibration over sample size (log scaled) by features on a well-specified model for both miscalibration metrics. The red dotted line indicates the selected nominal threshold of $\alpha=0.2$.
  • Figure 3: Valid Power Gain by Features. Calibrated with a nonlinear adversary. Performance evaluated with $10m$ samples on a nonlinear response $Y$.
  • Figure 4: Valid Power Gain by Distribution. Calibrated with a nonlinear adversary. Performance evaluated on a nonlinear response $Y$.
  • Figure 5: Valid Power and FDR Comparison on Gene Expression Data. Calibrated with FDP metric. Nonlinear response $Y$.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1