Table of Contents
Fetching ...

Informed Burn-In Decisions in RAR: Harmonizing Adaptivity and Inferential Precision Based on Study Setting

Lukas Pin, Stef Baas, Gianmarco Caruso, David S. Robertson, Sofía S. Villar

TL;DR

This paper tackles the lack of systematic guidance for burn-in length in response-adaptive randomization (RAR) by introducing the first principled framework that combines problem difficulty, design reactiveness, and expected final allocation error into a single burn-in formula. The authors define two novel metrics, the reactiveness parameter $r$ and the expected final allocation error $\\epsilon_\\rho$, along with the standardized treatment effect $\\delta$, to quantify exploration-exploitation trade-offs and stability. The core contribution is a practical rule, $b = \max\left\{ 2, \left\lfloor 0.5 \cdot \frac{n \cdot n_{1/2}}{n+n_{1/2}} \cdot (r+\\epsilon_\\rho)^{\\delta} \right\rfloor \right\}$, that determines burn-in length and adapts to trial size and design risk. Simulation results on ARREST and CALISTO demonstrate that this framework stabilizes inference (reducing Type-I error inflation and MSE) while preserving or improving patient benefit and final power, outperforming fixed burn-in heuristics, and offering a data-driven tool for practitioners.

Abstract

Response-Adaptive Randomization (RAR) is recognized for its potential to deliver improvements in patient benefit. However, the utility of RAR is contingent on regularization methods to mitigate early instability and preserve statistical integrity. A standard regularization approach is the ''burn-in'' period, an initial phase of equal randomization before treatment allocation adapts based on accrued data. The length of this burn-in is a critical design parameter, yet its selection remains unsystematic and improvised, as no established guideline exists. A poorly chosen length poses significant risks: one that is too short leads to high estimation bias and type-I error rate inflation, while one that is too long impedes the intended patient and power benefits of using adaptation. The challenge of selecting the burn-in generalizes to a fundamental question: what is the statistically appropriate timing for the first adaptation? We introduce the first systematic framework for determining burn-in length. This framework synthesizes core factors - total sample size, problem difficulty, and two novel metrics (reactivity and expected final allocation error) - into a single, principled formula. Simulation studies, grounded in real-world designs, demonstrate that lengths derived from our formula successfully stabilize the trial. The formula identifies a ''sweet spot'' that mitigates type-I error rate inflation and mean-squared error, preserving the advantages of higher power and patient benefit. This framework moves researchers from conjecture toward a systematic, reliable approach.

Informed Burn-In Decisions in RAR: Harmonizing Adaptivity and Inferential Precision Based on Study Setting

TL;DR

This paper tackles the lack of systematic guidance for burn-in length in response-adaptive randomization (RAR) by introducing the first principled framework that combines problem difficulty, design reactiveness, and expected final allocation error into a single burn-in formula. The authors define two novel metrics, the reactiveness parameter and the expected final allocation error , along with the standardized treatment effect , to quantify exploration-exploitation trade-offs and stability. The core contribution is a practical rule, , that determines burn-in length and adapts to trial size and design risk. Simulation results on ARREST and CALISTO demonstrate that this framework stabilizes inference (reducing Type-I error inflation and MSE) while preserving or improving patient benefit and final power, outperforming fixed burn-in heuristics, and offering a data-driven tool for practitioners.

Abstract

Response-Adaptive Randomization (RAR) is recognized for its potential to deliver improvements in patient benefit. However, the utility of RAR is contingent on regularization methods to mitigate early instability and preserve statistical integrity. A standard regularization approach is the ''burn-in'' period, an initial phase of equal randomization before treatment allocation adapts based on accrued data. The length of this burn-in is a critical design parameter, yet its selection remains unsystematic and improvised, as no established guideline exists. A poorly chosen length poses significant risks: one that is too short leads to high estimation bias and type-I error rate inflation, while one that is too long impedes the intended patient and power benefits of using adaptation. The challenge of selecting the burn-in generalizes to a fundamental question: what is the statistically appropriate timing for the first adaptation? We introduce the first systematic framework for determining burn-in length. This framework synthesizes core factors - total sample size, problem difficulty, and two novel metrics (reactivity and expected final allocation error) - into a single, principled formula. Simulation studies, grounded in real-world designs, demonstrate that lengths derived from our formula successfully stabilize the trial. The formula identifies a ''sweet spot'' that mitigates type-I error rate inflation and mean-squared error, preserving the advantages of higher power and patient benefit. This framework moves researchers from conjecture toward a systematic, reliable approach.

Paper Structure

This paper contains 26 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Heatmap of the standardized treatment effect ($\delta$) across the full parameter space. The dashed red contour marks $\delta=1$. The color scale is capped at 2 for visualization, as $\delta$ approaches infinity in the top-left ($p_0 \to 0, p_1 \to 1$) and bottom-right ($p_0 \to 1, p_1 \to 0$) corners.
  • Figure 2: Total available burn-in proportion (largest $2b$ possible) based on sample size ($n$) and $\frac{n \cdot n_{1/2}}{n+n_{1/2}}$ with $n_{1/2}=1000$. The gray dashed line reflects the largest burn-in budget upper bound ($2b=n$).
  • Figure 3: Two sampled paths (1000 participants) for the $R_0$ design under $p_0=0.6$, $p_1=0.8$ leading to $\rho=0.64$. The solid path displays a slower trend than the dashed path and has not converged yet, ending up at a value higher than~$\rho$; hence, the estimated geometric slope is $0.23$, while the error $\epsilon^{1}_{\rho}$ is around 0.03. The dashed path has converged to a value around $\rho$ and hence has a higher slope $\hat{c}_2(\rho)=0.37$ than the solid path and error $\epsilon^{2}_{\rho}=0.00$ as the final proportion is below $\rho.$