"Trust me on this" Explaining Agent Behavior to a Human Terminator
Uri Menkes, Assaf Hallak, Ofra Amir
TL;DR
The paper tackles facilitating effective human intervention in human-AI teams by explaining an autonomous agent's policy within a termination framework (TerMDP). It proposes a policy-summarization approach that selects informative trajectories using multiple scoring schemes, including uncertainty reduction, Q-value reconstruction fidelity, and policy-likelihood, with a genetic algorithm to optimize the summaries. Through simulations in Highway-Env and human-subject studies, the authors show that higher-scoring summaries improve humans' ability to decide when to take over, implying enhanced team performance. This work advances explainable RL in the context of human-in-the-loop termination, offering practical methods for improving trust, intervention timing, and collaborative efficacy in safety-critical domains.
Abstract
Consider a setting where a pre-trained agent is operating in an environment and a human operator can decide to temporarily terminate its operation and take-over for some duration of time. These kind of scenarios are common in human-machine interactions, for example in autonomous driving, factory automation and healthcare. In these settings, we typically observe a trade-off between two extreme cases -- if no take-overs are allowed, then the agent might employ a sub-optimal, possibly dangerous policy. Alternatively, if there are too many take-overs, then the human has no confidence in the agent, greatly limiting its usefulness. In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.
