Table of Contents
Fetching ...

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel, Wolfgang Stammer, Kristian Kersting

TL;DR

Successive Concept Bottleneck Agents* are introduced, that integrate consecutive concept bottleneck (CB) layers and enable us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it.

Abstract

Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce *Successive Concept Bottleneck Agents* (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots' competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it. Overall, SCoBots thus result in more human-aligned RL agents. Our code is available at https://github.com/k4ntz/SCoBots .

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

TL;DR

Successive Concept Bottleneck Agents* are introduced, that integrate consecutive concept bottleneck (CB) layers and enable us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it.

Abstract

Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce *Successive Concept Bottleneck Agents* (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots' competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it. Overall, SCoBots thus result in more human-aligned RL agents. Our code is available at https://github.com/k4ntz/SCoBots .
Paper Structure (23 sections, 6 equations, 13 figures, 5 tables)

This paper contains 23 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Successive Concept Bottlenecks Agents (SCoBots) allow for easy inspection and revision. Top: Deep RL agents trained on Pong produce high playing scores with importance map explanations that suggest sensible underlying reasons for taking an action (*B). However, when the enemy is hidden, the deep RL agent fails to even catch the ball without clear reasons (*B). Bottom: SCoBots, on the other hand, allow for multi-level inspection of the reasoning behind the action selection, e.g., at a relational concept, but also action level. Moreover, they allow users to easily intervene on them (*B) to prevent the agents from focusing on potentially misleading concepts. In this way, SCoBots can mitigate RL specific caveats like goal misalignment.
  • Figure 2: An overview of Successive Concept Bottlenecks Agents (SCoBots). SCoBots decompose the policy into consecutive interpretable concept bottlenecks (ICB). Objects and their properties are first extracted from the raw input, human-understandable functions are then employed to derive relational concepts, used to select an action. The understandable concepts enable interactivity. Each bottleneck allows expert users to, e.g., prune or utilize concepts to define additional reward signals.
  • Figure 3: Object-centric agents can master different Atari environments and interactive SCoBots allow for corrections. Human-normalized scores of different agents trained using PPO on $9$ ALE environments, including deep agents (i.e. using CNNs), guided decision tree policy (SCoBots), their neural object-centric baseline (NN-), and these baselines without guidance (NG). SCoBots obtain similar or better scores than the deep agents, showing that object-centric agents can also solve RL tasks while making use of human-understandable concepts (left). Guiding SCoBots allow to correct misalignment in Pong (center) and to obtain the originally intended agents, depicted by a level completion score of 100% on the intended goal's evaluation in Kangaroo (right).
  • Figure 4: Interpretable SCoBots allow to follow their decision process, thanks to their interpretable concepts. The states and associated decision processes of SCoBots (extracted from the decision trees) on Skiing (left), and from unguided SCoBots on Pong (middle) and Kangaroo (right). For example, in this Skiing state, our SCoBot selects RIGHT, as the signed distance between Player and the (left) Flag1 (on the $x$ axis) is bigger than $15$. This agent selects the correct action for the right reasons.
  • Figure 5: SCoBots can learn with noisy object detectors, transparent SCoBots rely on relations. Final human normalized scores (with stds) comparing SCoBots and the object-centric neural baselines (NN-SCoBots), with and without relations. We also provide the scores of NN-SCoBots that learned on noisy environments. The noise only noticeably affects the agents on Kangaroo. Ablating the relations is harmless on NN-SCobots, as neural networks can recompute them, but impacts SCoBots performances on $6$ games. ($^{*}$For better visualization, we used a human score of $100$ in Boxing.)
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition A.1