Table of Contents
Fetching ...

Causal bandits with backdoor adjustment on unknown Gaussian DAGs

Yijia Zhao, Qing Zhou

TL;DR

This work addresses causal bandits where the underlying Gaussian DAG is unknown, and arms correspond to atomic interventions on variables affecting a reward. It introduces BA-UCB, a bandit algorithm that identifies backdoor adjustment sets from both observational and experimental data and combines their estimates to form confidence bounds without needing full graph recovery. The authors establish finite-sample regret bounds, showing reduced dependence on the number of arms when a large amount of prior observational data is available, and demonstrate empirical advantages in regret and computation time over baselines. The approach offers a practical path for efficient causal decision-making under graph uncertainty, with potential extensions to non-Gaussian settings and unobserved confounding.

Abstract

The causal bandit problem aims to sequentially learn the intervention that maximizes the expectation of a reward variable within a system governed by a causal graph. Most existing approaches assume prior knowledge of the graph structure, or impose unrealistically restrictive conditions on the graph. In this paper, we assume a Gaussian linear directed acyclic graph (DAG) over arms and the reward variable, and study the causal bandit problem when the graph structure is unknown. We identify backdoor adjustment sets for each arm using sequentially generated experimental and observational data during the decision process, which allows us to estimate causal effects and construct upper confidence bounds. By integrating estimates from both data sources, we develop a novel bandit algorithm, based on modified upper confidence bounds, to sequentially determine the optimal intervention. We establish both case-dependent and case-independent upper bounds on the cumulative regret for our algorithm, which improve upon the bounds of the standard multi-armed bandit algorithms. Our empirical study demonstrates its advantage over existing methods with respect to cumulative regret and computation time.

Causal bandits with backdoor adjustment on unknown Gaussian DAGs

TL;DR

This work addresses causal bandits where the underlying Gaussian DAG is unknown, and arms correspond to atomic interventions on variables affecting a reward. It introduces BA-UCB, a bandit algorithm that identifies backdoor adjustment sets from both observational and experimental data and combines their estimates to form confidence bounds without needing full graph recovery. The authors establish finite-sample regret bounds, showing reduced dependence on the number of arms when a large amount of prior observational data is available, and demonstrate empirical advantages in regret and computation time over baselines. The approach offers a practical path for efficient causal decision-making under graph uncertainty, with potential extensions to non-Gaussian settings and unobserved confounding.

Abstract

The causal bandit problem aims to sequentially learn the intervention that maximizes the expectation of a reward variable within a system governed by a causal graph. Most existing approaches assume prior knowledge of the graph structure, or impose unrealistically restrictive conditions on the graph. In this paper, we assume a Gaussian linear directed acyclic graph (DAG) over arms and the reward variable, and study the causal bandit problem when the graph structure is unknown. We identify backdoor adjustment sets for each arm using sequentially generated experimental and observational data during the decision process, which allows us to estimate causal effects and construct upper confidence bounds. By integrating estimates from both data sources, we develop a novel bandit algorithm, based on modified upper confidence bounds, to sequentially determine the optimal intervention. We establish both case-dependent and case-independent upper bounds on the cumulative regret for our algorithm, which improve upon the bounds of the standard multi-armed bandit algorithms. Our empirical study demonstrates its advantage over existing methods with respect to cumulative regret and computation time.

Paper Structure

This paper contains 22 sections, 5 theorems, 103 equations, 5 figures, 3 algorithms.

Key Result

Theorem 1

Under Assumption ass:id and Assumption ass:n0, the cumulative regret of Algorithm alg:BA-UCB, with parameters $c\geq 4\sqrt{2}\widetilde{\phi}$ and $c_3\geq \max\left\{64,32\psi^2/\delta^2\right\}$, after $T$ rounds is at most where $C_3$ is a constant that does not depend on $T$.

Figures (5)

  • Figure 1: Comparison of empirical cumulative regrets over 5000 time steps for BA-UCB, BBB-UCB, CN-UCB, and UCB algorithms when $p=10$.
  • Figure 2: Comparison of empirical cumulative regrets over 5000 time steps for BA-UCB, CN-UCB, and UCB algorithms under the settings of $p=20,30,50$. The top panel reports the cases where the optimal arm is not a parent of the reward variable and the lower panel reports the cases where the optimal arm is a parent of the reward variable.
  • Figure 3: Ranges of empirical cumulative regrets between 5% quantile and 95% quantile at $T=1000, 3000, 5000$ across 100 simulated Gaussian DAGs for BA-UCB, BBB-UCB, CN-UCB, UCB algorithms when $p=10$.
  • Figure 4: Boxplots of cumulative regrets after $T=5000$ rounds for BA-UCB, CN-UCB, UCB algorithms when $p=20, 30, 50$. Left: Optimal arm is a parent of the reward. Right: Optimal arm is not a parent of the reward.
  • Figure 5: Comparison of empirical cumulative regrets of BA-UCB algorithms with estimates generated from weighted sum and one linear regression over time when $p=10, 20, 30$.

Theorems & Definitions (7)

  • Remark 1
  • Remark 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4: Case-dependent bound
  • Theorem 5: Case-independent bound