Table of Contents
Fetching ...

Detecting Where Effects Occur by Testing Hypotheses in Order

Jake Bowers, David Kim, Nuole Chen

Abstract

Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After a report of an overall result -- statistically significant or not -- the natural question from a policy maker is: \emph{where} did any effects occur? Standard adjustments for multiple testing provide little power to answer this question. In simulations modeled after a 44-block education trial, the Hommel adjustment -- among the most powerful procedures controlling the family-wise error rate (FWER) -- detects effects in only 11\% of truly non-null blocks. We develop a procedure that tests hypotheses top-down through a tree: test the overall null at the root, then groups of blocks, then individual blocks, stopping any branch where the null is not rejected. In the same 44-block design, this approach detects effects in 44\% of non-null blocks -- roughly four times the detection rate. A stopping rule and valid tests at each node suffice for weak FWER control. We show that the strong-sense FWER depends on how rejection probabilities accumulate along paths through the tree. This yields a diagnostic: when power decays fast enough relative to branching, no adjustment is needed; otherwise, an adaptive $α$-adjustment restores control. We apply the method to 25 MDRC education trials and provide an R package, \texttt{manytestsr}.

Detecting Where Effects Occur by Testing Hypotheses in Order

Abstract

Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After a report of an overall result -- statistically significant or not -- the natural question from a policy maker is: \emph{where} did any effects occur? Standard adjustments for multiple testing provide little power to answer this question. In simulations modeled after a 44-block education trial, the Hommel adjustment -- among the most powerful procedures controlling the family-wise error rate (FWER) -- detects effects in only 11\% of truly non-null blocks. We develop a procedure that tests hypotheses top-down through a tree: test the overall null at the root, then groups of blocks, then individual blocks, stopping any branch where the null is not rejected. In the same 44-block design, this approach detects effects in 44\% of non-null blocks -- roughly four times the detection rate. A stopping rule and valid tests at each node suffice for weak FWER control. We show that the strong-sense FWER depends on how rejection probabilities accumulate along paths through the tree. This yields a diagnostic: when power decays fast enough relative to branching, no adjustment is needed; otherwise, an adaptive -adjustment restores control. We apply the method to 25 MDRC education trials and provide an R package, \texttt{manytestsr}.
Paper Structure (30 sections, 9 theorems, 30 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 9 theorems, 30 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Conditions cond:stopping and cond:valid suffice for weak FWER control A family of true null hypotheses organized on an irregular or regular $k$-ary tree and tested following the stopping rule (Condition cond:stopping) with valid tests at each node (Condition cond:valid) will produce a family-wise er

Figures (5)

  • Figure 1: An administratively organized structure of blocks. A study randomly assigns people within offices ($B$) to a new intervention. Each office is an experimental block containing $m_b$ people assigned to the intervention and $n_b - m_b$ people assigned to the status quo.
  • Figure 2: A $k$-ary tree with $k=3$ nodes per level and $L=3$ levels and $k^{L-1}=9$ terminal nodes or "leaves" representing individual experimental blocks.
  • Figure 3: Simplified flow of the Top-Down Testing and Splitting Algorithm with fixed false positive level $\alpha$. All blocks are in set $\mathcal{B}_1$, $\mathcal{B}_2$ is a subset of $\mathcal{B}_1$, $\mathcal{B}_{4}$ is a subset of the blocks in $\mathcal{B}_2$. The $p$-value, $p_1$, is the result from a test of the hypothesis of no effects using all the blocks (i.e using the set $\mathcal{B}_1$), $p_2$ is the $p$-value from a test of the null of no effects using only the blocks in $\mathcal{B}_2$. Testing stops when $p > \alpha$ or when the number of blocks in $\mathcal{B}$, written $|\mathcal{B}|$, is 1 such that for a given node $i$, $|\mathcal{B}_{i}|=1$.
  • Figure 4: A $k$-ary tree with $k=3$ and $L=3$. Boxes show non-null nodes: since the leaf (node 5) is non-null, all of its ancestors are non-null. The other nodes in the tree are null.
  • Figure 5: Results of top-down testing in a simulation of 44 experimental blocks following the pre-specified experimental design of the Detroit Promise Program ratledge2019path. Nine blocks within HFCC have non-zero effects (Cohen's $d = 0.80$); all other blocks are pure null. The algorithm identifies HFCC and descends into its cohorts and blocks while pruning null colleges. Abbreviations: Henry Ford Community College (HFCC), Macomb Community College (MCC), Oakland Community College (OCC), Schoolcraft College (SC), Wayne County Community College District (WCC). Blue nodes have non-zero causal effects.

Theorems & Definitions (26)

  • Theorem 1
  • Remark 1
  • Theorem 2: Conditions 1 and 2 suffice for weak FWER control; restated from main text
  • proof : Proof of Theorem \ref{['thm:weakctrl']}
  • Remark 2
  • Remark 3: Relationship to prior work
  • Proposition 1: FWER Expression for Sequential Tree Testing
  • proof
  • Remark 4: When is the bound tight?
  • Theorem 3: FWER Decomposition by Level
  • ...and 16 more