Table of Contents
Fetching ...

"Trust me on this" Explaining Agent Behavior to a Human Terminator

Uri Menkes, Assaf Hallak, Ofra Amir

TL;DR

The paper tackles facilitating effective human intervention in human-AI teams by explaining an autonomous agent's policy within a termination framework (TerMDP). It proposes a policy-summarization approach that selects informative trajectories using multiple scoring schemes, including uncertainty reduction, Q-value reconstruction fidelity, and policy-likelihood, with a genetic algorithm to optimize the summaries. Through simulations in Highway-Env and human-subject studies, the authors show that higher-scoring summaries improve humans' ability to decide when to take over, implying enhanced team performance. This work advances explainable RL in the context of human-in-the-loop termination, offering practical methods for improving trust, intervention timing, and collaborative efficacy in safety-critical domains.

Abstract

Consider a setting where a pre-trained agent is operating in an environment and a human operator can decide to temporarily terminate its operation and take-over for some duration of time. These kind of scenarios are common in human-machine interactions, for example in autonomous driving, factory automation and healthcare. In these settings, we typically observe a trade-off between two extreme cases -- if no take-overs are allowed, then the agent might employ a sub-optimal, possibly dangerous policy. Alternatively, if there are too many take-overs, then the human has no confidence in the agent, greatly limiting its usefulness. In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.

"Trust me on this" Explaining Agent Behavior to a Human Terminator

TL;DR

The paper tackles facilitating effective human intervention in human-AI teams by explaining an autonomous agent's policy within a termination framework (TerMDP). It proposes a policy-summarization approach that selects informative trajectories using multiple scoring schemes, including uncertainty reduction, Q-value reconstruction fidelity, and policy-likelihood, with a genetic algorithm to optimize the summaries. Through simulations in Highway-Env and human-subject studies, the authors show that higher-scoring summaries improve humans' ability to decide when to take over, implying enhanced team performance. This work advances explainable RL in the context of human-in-the-loop termination, offering practical methods for improving trust, intervention timing, and collaborative efficacy in safety-critical domains.

Abstract

Consider a setting where a pre-trained agent is operating in an environment and a human operator can decide to temporarily terminate its operation and take-over for some duration of time. These kind of scenarios are common in human-machine interactions, for example in autonomous driving, factory automation and healthcare. In these settings, we typically observe a trade-off between two extreme cases -- if no take-overs are allowed, then the agent might employ a sub-optimal, possibly dangerous policy. Alternatively, if there are too many take-overs, then the human has no confidence in the agent, greatly limiting its usefulness. In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.

Paper Structure

This paper contains 13 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Top. To explain the agent's behavior to the human, we use a summarizer module to select a subset of trajectories from a large set of simulated data. The selected subset is rendered and shown to the human to improve her understanding of the agent. Bottom. During online interaction, the human can decide at any point to terminate the agent and take-over the policy for $h$ steps.
  • Figure 2: The genetic algorithm optimizes for score \ref{['s3']} and is tested across different summary lengths. The score demonstrates improvement as the generation size increases. Similar convergence trend results were observed for scores \ref{['s1']}, \ref{['s2']}
  • Figure 3: Results of the termination task study. Participants who viewed the higher score summaries outperformed participants who viewed the lower score summaries ($n=160$, $p$-value = 0.02).