BXRL: Behavior-Explainable Reinforcement Learning

Ram Rachum; Yotam Amitai; Yonatan Nakar; Reuth Mirsky; Cameron Allen

BXRL: Behavior-Explainable Reinforcement Learning

Ram Rachum, Yotam Amitai, Yonatan Nakar, Reuth Mirsky, Cameron Allen

Abstract

A major challenge of Reinforcement Learning is that agents often learn undesired behaviors that seem to defy the reward structure they were given. Explainable Reinforcement Learning (XRL) methods can answer queries such as "explain this specific action", "explain this specific trajectory", and "explain the entire policy". However, XRL lacks a formal definition for behavior as a pattern of actions across many episodes. We provide such a definition, and use it to enable a new query: "Explain this behavior". We present Behavior-Explainable Reinforcement Learning (BXRL), a new problem formulation that treats behaviors as first-class objects. BXRL defines a behavior measure as any function $m : Π\to \mathbb{R}$, allowing users to precisely express the pattern of actions that they find interesting and measure how strongly the policy exhibits it. We define contrastive behaviors that reduce the question "why does the agent prefer $a$ to $a'$?" to "why is $m(π)$ high?" which can be explored with differentiation. We do not implement an explainability method; we instead analyze three existing methods and propose how they could be adapted to explain behavior. We present a port of the HighwayEnv driving environment to JAX, which provides an interface for defining, measuring, and differentiating behaviors with respect to the model parameters.

BXRL: Behavior-Explainable Reinforcement Learning

Abstract

, allowing users to precisely express the pattern of actions that they find interesting and measure how strongly the policy exhibits it. We define contrastive behaviors that reduce the question "why does the agent prefer

?" to "why is

high?" which can be explored with differentiation. We do not implement an explainability method; we instead analyze three existing methods and propose how they could be adapted to explain behavior. We present a port of the HighwayEnv driving environment to JAX, which provides an interface for defining, measuring, and differentiating behaviors with respect to the model parameters.

Paper Structure (38 sections, 7 equations, 3 figures, 4 tables)

This paper contains 38 sections, 7 equations, 3 figures, 4 tables.

Introduction
Background
Explanation target and source
Contrastive explanation
Behavior as disposition
Markov Decision Process (MDP)
Attribution and importance methods
Properties of Behavior
Related Work
rishav2025behavior
HIGHLIGHTS amir2018highlights
ASQ-IT amitai2024asqit
Behavior-Explainable Reinforcement Learning (BXRL)
Expressing a Behavior as a Number
Adapting Existing XRL Methods for BXRL
...and 23 more sections

Figures (3)

Figure 1: HighJax TUI for defining behavior scenarios by selecting states and actions from existing rollouts.
Figure 2: Left: Return breakdown of a PPO agent training on HighJax. Training is deliberately slowed down to $D_{\mathrm{KL}} = 5 \times 10^{-4}$ per epoch. Right: Value of the collision behavior measure $m_c$ across training.
Figure 3: Six scenarios defining the collision behavior measure $m_c$. Each causes the ego car (dark gray) to crash within two timesteps. Columns correspond to the action taken: left, right, or faster.

Theorems & Definitions (2)

Definition 1: Behavior measure
Definition 2: BXRL Method

BXRL: Behavior-Explainable Reinforcement Learning

Abstract

BXRL: Behavior-Explainable Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (2)