Table of Contents
Fetching ...

Effect-Level Validation for Causal Discovery

Hoang Dang, Luan Pham, Minh Nguyen

TL;DR

This paper reframes causal discovery for telemetry as an effect-level decision problem, introducing an admissibility-first pipeline that prioritizes identifiability and positivity over graph-family fit. By enforcing domain constraints and applying effect-level validation (stability, refutation, and sensitivity analyses), the authors show that many statically plausible graphs fail to identify the target estimand, while a subset of algorithms yield stable, domain-consistent effects despite structural differences. Empirical results from early PvP exposure on Day-1 retention reveal an approximate 15 percentage-point uplift in retention among paying players when the effect is identifiable, and robustness checks (placebo, subsampling, E-values) support moderate robustness to unmeasured confounding within the admissibility framework. The work argues for adopting admissibility gates, stability checks, and falsification tests as standard evaluation criteria, and provides practical guidance for deploying causal discovery in feedback-driven telemetry systems.

Abstract

Causal discovery is increasingly applied to large-scale telemetry data to estimate the effects of user-facing interventions, yet its reliability for decision-making in feedback-driven systems with strong self-selection remains unclear. In this paper, we propose an effect-centric, admissibility-first framework that treats discovered graphs as structural hypotheses and evaluates them by identifiability, stability, and falsification rather than by graph recovery accuracy alone. Empirically, we study the effect of early exposure to competitive gameplay on short-term retention using real-world game telemetry. We find that many statistically plausible discovery outputs do not admit point-identified causal queries once minimal temporal and semantic constraints are enforced, highlighting identifiability as a critical bottleneck for decision support. When identification is possible, several algorithm families converge to similar, decision-consistent effect estimates despite producing substantially different graph structures, including cases where the direct treatment-outcome edge is absent and the effect is preserved through indirect causal pathways. These converging estimates survive placebo, subsampling, and sensitivity refutation. In contrast, other methods exhibit sporadic admissibility and threshold-sensitive or attenuated effects due to endpoint ambiguity. These results suggest that graph-level metrics alone are inadequate proxies for causal reliability for a given target query. Therefore, trustworthy causal conclusions in telemetry-driven systems require prioritizing admissibility and effect-level validation over causal structural recovery alone.

Effect-Level Validation for Causal Discovery

TL;DR

This paper reframes causal discovery for telemetry as an effect-level decision problem, introducing an admissibility-first pipeline that prioritizes identifiability and positivity over graph-family fit. By enforcing domain constraints and applying effect-level validation (stability, refutation, and sensitivity analyses), the authors show that many statically plausible graphs fail to identify the target estimand, while a subset of algorithms yield stable, domain-consistent effects despite structural differences. Empirical results from early PvP exposure on Day-1 retention reveal an approximate 15 percentage-point uplift in retention among paying players when the effect is identifiable, and robustness checks (placebo, subsampling, E-values) support moderate robustness to unmeasured confounding within the admissibility framework. The work argues for adopting admissibility gates, stability checks, and falsification tests as standard evaluation criteria, and provides practical guidance for deploying causal discovery in feedback-driven telemetry systems.

Abstract

Causal discovery is increasingly applied to large-scale telemetry data to estimate the effects of user-facing interventions, yet its reliability for decision-making in feedback-driven systems with strong self-selection remains unclear. In this paper, we propose an effect-centric, admissibility-first framework that treats discovered graphs as structural hypotheses and evaluates them by identifiability, stability, and falsification rather than by graph recovery accuracy alone. Empirically, we study the effect of early exposure to competitive gameplay on short-term retention using real-world game telemetry. We find that many statistically plausible discovery outputs do not admit point-identified causal queries once minimal temporal and semantic constraints are enforced, highlighting identifiability as a critical bottleneck for decision support. When identification is possible, several algorithm families converge to similar, decision-consistent effect estimates despite producing substantially different graph structures, including cases where the direct treatment-outcome edge is absent and the effect is preserved through indirect causal pathways. These converging estimates survive placebo, subsampling, and sensitivity refutation. In contrast, other methods exhibit sporadic admissibility and threshold-sensitive or attenuated effects due to endpoint ambiguity. These results suggest that graph-level metrics alone are inadequate proxies for causal reliability for a given target query. Therefore, trustworthy causal conclusions in telemetry-driven systems require prioritizing admissibility and effect-level validation over causal structural recovery alone.
Paper Structure (31 sections, 6 equations, 2 figures, 10 tables)

This paper contains 31 sections, 6 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Domain-admissible causal graph constructed with pre-specified constraints. This graph satisfies temporal and semantic invariants and admits valid backdoor identification for the target estimand.
  • Figure 2: Propensity score distributions for treated (PvP=1) and control (PvP=0) groups. Substantial overlap exists in the common support region [0.032, 0.935].