Table of Contents
Fetching ...

Learning Optimal and Sample-Efficient Decision Policies with Guarantees

Daqian Shao

TL;DR

A sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees is derived and a provably optimal learning algorithm is developed that improves sample efficiency over existing methods is developed.

Abstract

The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.

Learning Optimal and Sample-Efficient Decision Policies with Guarantees

TL;DR

A sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees is derived and a provably optimal learning algorithm is developed that improves sample efficiency over existing methods is developed.

Abstract

The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.
Paper Structure (146 sections, 20 theorems, 177 equations, 18 figures, 10 tables, 4 algorithms)

This paper contains 146 sections, 20 theorems, 177 equations, 18 figures, 10 tables, 4 algorithms.

Key Result

Theorem \getrefnumber

Figures (18)

  • Figure 1: An overview of the types of problems considered in policy learning, categorised by the learning objective and the data source from which the policy learns. Problems considered in this thesis are highlighted with the corresponding chapters.
  • Figure 2: The causal graph of outcome $Y$, treatment $A$ and hidden confounder $U$.
  • Figure 3: The causal graph of outcome $Y$, treatment $A$, hidden confounder $U$ and an instrumental variable $Z$.
  • Figure 4: The causal graph of outcome $Y$, treatment $A$, hidden confounder $U$ and proxies $V$ and $W$.
  • Figure 5: The causal graph of the contextual IV setting, where $R=f_r(C,A)+\epsilon$ and $Z$ is an instrumental variable that affects $R$ only through $A$.
  • ...and 13 more figures

Theorems & Definitions (58)

  • Theorem \getrefnumber: 2
  • Lemma \getrefnumber: 2
  • Proposition \getrefnumber: 2
  • Corollary \getrefnumber: 2
  • Definition \orignum: Markov decision processes Puterman1994
  • Definition \orignum: MDP with atomic proposition labels
  • Definition \orignum: Structural Causal Model
  • Definition \orignum: Conditional Moment Restriction
  • Definition \orignum: Instrumental Variable
  • Proposition \orignum: Miao2018
  • ...and 48 more