Causal bandits with backdoor adjustment on unknown Gaussian DAGs
Yijia Zhao, Qing Zhou
TL;DR
This work addresses causal bandits where the underlying Gaussian DAG is unknown, and arms correspond to atomic interventions on variables affecting a reward. It introduces BA-UCB, a bandit algorithm that identifies backdoor adjustment sets from both observational and experimental data and combines their estimates to form confidence bounds without needing full graph recovery. The authors establish finite-sample regret bounds, showing reduced dependence on the number of arms when a large amount of prior observational data is available, and demonstrate empirical advantages in regret and computation time over baselines. The approach offers a practical path for efficient causal decision-making under graph uncertainty, with potential extensions to non-Gaussian settings and unobserved confounding.
Abstract
The causal bandit problem aims to sequentially learn the intervention that maximizes the expectation of a reward variable within a system governed by a causal graph. Most existing approaches assume prior knowledge of the graph structure, or impose unrealistically restrictive conditions on the graph. In this paper, we assume a Gaussian linear directed acyclic graph (DAG) over arms and the reward variable, and study the causal bandit problem when the graph structure is unknown. We identify backdoor adjustment sets for each arm using sequentially generated experimental and observational data during the decision process, which allows us to estimate causal effects and construct upper confidence bounds. By integrating estimates from both data sources, we develop a novel bandit algorithm, based on modified upper confidence bounds, to sequentially determine the optimal intervention. We establish both case-dependent and case-independent upper bounds on the cumulative regret for our algorithm, which improve upon the bounds of the standard multi-armed bandit algorithms. Our empirical study demonstrates its advantage over existing methods with respect to cumulative regret and computation time.
