Learning to Make Adherence-Aware Advice

Guanting Chen; Xiaocheng Li; Chunlin Sun; Hanzhao Wang

Learning to Make Adherence-Aware Advice

Guanting Chen, Xiaocheng Li, Chunlin Sun, Hanzhao Wang

TL;DR

This work develops a theory and algorithms for adherence-aware AI advice in sequential decision-making, modeling human adherence with $\theta(s,a)$ and a defer option for the machine. It introduces a human–machine MDP framework with two learning environments: $\mathcal{E}_1$ (partially known) and $\mathcal{E}_2$ (fully unknown), and provides two tailored learning approaches. The first is a UCB-based method (UCB-ADherence) achieving a PAC bound of $O(H^2 S^2 A / \epsilon^2)$ in $\mathcal{E}_1$, leveraging the monotonicity of optimal value in $\theta$. The second is a reward-free exploration approach (RFE-$\beta$) for $\mathcal{E}_2$ that attains near-optimal policies across all $\beta>0$ with $O(H^5 S A / \epsilon^2)$ episodes and connects to CMDP formulations for pertinent advice. Empirical results on Flappy Bird and Car Driving show superior sample efficiency and practical effectiveness of the specialized algorithms over generic RL baselines, highlighting the value of incorporating human adherence and selective advising into learning for human–AI collaboration.

Abstract

As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human's adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.

Learning to Make Adherence-Aware Advice

TL;DR

This work develops a theory and algorithms for adherence-aware AI advice in sequential decision-making, modeling human adherence with

and a defer option for the machine. It introduces a human–machine MDP framework with two learning environments:

(partially known) and

(fully unknown), and provides two tailored learning approaches. The first is a UCB-based method (UCB-ADherence) achieving a PAC bound of

, leveraging the monotonicity of optimal value in

. The second is a reward-free exploration approach (RFE-

) for

that attains near-optimal policies across all

with

episodes and connects to CMDP formulations for pertinent advice. Empirical results on Flappy Bird and Car Driving show superior sample efficiency and practical effectiveness of the specialized algorithms over generic RL baselines, highlighting the value of incorporating human adherence and selective advising into learning for human–AI collaboration.

Abstract

Paper Structure (33 sections, 10 theorems, 99 equations, 7 figures, 3 algorithms)

This paper contains 33 sections, 10 theorems, 99 equations, 7 figures, 3 algorithms.

Introduction
Related Work
Model Setup
The learning problem
Main Results
Algorithms and Analyses
UCB-based algorithm for $\mathcal{E}_1$
Reward-free exploration algorithm for $\mathcal{E}_2$
Numerical Experiment
Proofs
Proofs for Section \ref{['sec_model']}
Supplementary Materials for Algorithm \ref{['alg:alg1']}
Notations of Algorithm \ref{['alg:alg1']}
Algorithm Analysis
Higher updating frequency.
...and 18 more sections

Key Result

Proposition 1

For all $s\in\mathcal{S}$ and $h\in[H]$ such that $\pi^*_{h,\beta}(s)\neq\text{defer}$, we have

Figures (7)

Figure 1: Flappy Bird environment: player needs to navigate the bird to avoid walls and collect stars.
Figure 2: The regrets for learning the optimal advice for Policy Greedy and Policy Safe. Figure \ref{['fig:regret_bird_greedy']}, \ref{['fig:regret_bird_safe']} show the regrets of RFE-AD, UCB-AD, and EULER for two policies respectively. Figure \ref{['fig:regret_UCB_AD_thetas']} shows the regrets of UCB-AD for two policies under different $\theta$'s.
Figure 3: The performances of making pertinent advice. The value gap is defined as the difference between the value of current policy and the optimal values, with the red dashed line as the benchmark for $0$ loss of the policy. Figure \ref{['fig:cvg_beta']} shows the convergence of RFE-$\beta$ under difference $\beta$'s. Figure \ref{['fig:cvg_CMDP_FR']} compares the convergences of RFE-CMDP and UC-CFH. Figure \ref{['fig:value_post_CMDP']} evaluates performance of policy learned from learning episodes in Figure \ref{['fig:cvg_CMDP_FR']}.
Figure 4: Typical trajectories of two policies' types. The red color means the machine defers and the green color means the machine advises.
Figure 5: Value Gaps of RFE-$\beta$.
...and 2 more figures

Theorems & Definitions (18)

Proposition 1
Proposition 2: Monotonicity property
Theorem 1
Theorem 2
Corollary 1
proof : Proof of Proposition \ref{['prop_monotone']}
proof : Proof of Proposition \ref{['prop_gap']}
Lemma 1
Lemma 2
proof : Proof of Theorem \ref{['thm:UCB-AD']}
...and 8 more

Learning to Make Adherence-Aware Advice

TL;DR

Abstract

Learning to Make Adherence-Aware Advice

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (18)