Table of Contents
Fetching ...

ClusterSC: Advancing Synthetic Control with Donor Selection

Saeyoung Rho, Andrew Tang, Noah Bergam, Rachel Cummings, Vishal Misra

TL;DR

This work extends synthetic control (SC) to disaggregate-level data by addressing the curse of dimensionality from large donor pools. It introduces ClusterSC, a two-stage method that clusters donors using right-singular-vector embeddings and applies SC only to the most relevant cluster, with HSVT denoising integrated into the Learn step. The authors prove theoretical guarantees for accurate subgroup identification and show improved pre- and post-intervention prediction bounds under Gaussian, sub-Gaussian, and heavy-tailed noise, complemented by empirical results on synthetic and real housing data. The approach yields higher prediction accuracy and stability, illustrating significant practical benefits for individual-level causal inference and policy evaluation.

Abstract

In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool. SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data. As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC. To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups. ClusterSC incorporates a clustering step to select only the relevant donors for the target. We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets. The results indicate that ClusterSC consistently outperforms classical SC approaches.

ClusterSC: Advancing Synthetic Control with Donor Selection

TL;DR

This work extends synthetic control (SC) to disaggregate-level data by addressing the curse of dimensionality from large donor pools. It introduces ClusterSC, a two-stage method that clusters donors using right-singular-vector embeddings and applies SC only to the most relevant cluster, with HSVT denoising integrated into the Learn step. The authors prove theoretical guarantees for accurate subgroup identification and show improved pre- and post-intervention prediction bounds under Gaussian, sub-Gaussian, and heavy-tailed noise, complemented by empirical results on synthetic and real housing data. The approach yields higher prediction accuracy and stability, illustrating significant practical benefits for individual-level causal inference and policy evaluation.

Abstract

In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool. SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data. As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC. To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups. ClusterSC incorporates a clustering step to select only the relevant donors for the target. We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets. The results indicate that ClusterSC consistently outperforms classical SC approaches.

Paper Structure

This paper contains 43 sections, 49 theorems, 113 equations, 11 figures, 4 algorithms.

Key Result

Lemma 5.0

For any $L$-bilipschitz function $g$, $(1/L^2)\Delta^2_k(\Theta) \leq \Delta^2_k(g(\Theta)) \leq L^2 \Delta^2_k(\Theta)$.

Figures (11)

  • Figure 1: Visualization of the distribution of rows in $\tilde{U}$ with two different subgroups in the donor units. Each row $\tilde{U}_i$ can be interpreted as an embedding of the unit $i$, representing the composition of right singular vectors for that row.
  • Figure 2: Median Post-intervention MSE using the classical SC without our clustering step (blue) and ClusterSC (orange) for varying levels of noise.
  • Figure 3: Median of the pairwise improvement $I_i$, measured for each dataset, for different noise levels ($s$). Shades represent 95% confidence interval.
  • Figure 4: Comparison of ClusterSC and two SC benchmarks on different regression methods (OLS, Ridge, and Lasso). Each boxplot contains 100 points representing the median MSE of each iteration.
  • Figure 5: Median post-intervention MSE, measured per dataset. Each boxplot corresponds to ridge, OLS, cluster and then ridge, and cluster and then OLS, from left to right, plotted for each noise level. Left plot is with $n=1000$ donor units in total and the right plot is with $n=2000$.
  • ...and 6 more figures

Theorems & Definitions (75)

  • Definition 3.1: $\varepsilon$-separation
  • Lemma 5.0
  • Lemma 5.0
  • proof
  • Lemma 5.1: ostrovsky2013effectiveness
  • Lemma 5.1
  • Lemma 5.1
  • Lemma 5.1
  • Lemma 5.1
  • Lemma 5.1
  • ...and 65 more