Multi-intention Inverse Q-learning for Interpretable Behavior Representation

Hao Zhu; Brice De La Crompe; Gabriel Kalweit; Artur Schneider; Maria Kalweit; Ilka Diester; Joschka Boedecker

Multi-intention Inverse Q-learning for Interpretable Behavior Representation

Hao Zhu, Brice De La Crompe, Gabriel Kalweit, Artur Schneider, Maria Kalweit, Ilka Diester, Joschka Boedecker

TL;DR

The paper tackles the challenge of recovering discrete-time, multi-intention rewards in animal and human decision-making by introducing Hierarchical Inverse Q-Learning (HIQL). HIQL uses EM to segment expert trajectories into $K$ latent intentions with a Markov-switching dynamic ($\\Pi$, $\\Lambda$) and solves IRL per intention with an inner IQL solver, enabling model-free learning. Across gridworld and real mice datasets, HIQL outperforms the state-of-the-art DIRL, providing higher predictive accuracy and interpretable, step-like reward structures that reveal exploitation, exploration, and other strategies. The approach offers a scalable framework for neuroscience and cognitive science to link observed behavior to latent reward functions and brain mechanisms, while enabling model-free learning in unknown environments.

Abstract

In advancing the understanding of natural decision-making processes, inverse reinforcement learning (IRL) methods have proven instrumental in reconstructing animal's intentions underlying complex behaviors. Given the recent development of a continuous-time multi-intention IRL framework, there has been persistent inquiry into inferring discrete time-varying rewards with IRL. To address this challenge, we introduce the class of hierarchical inverse Q-learning (HIQL) algorithms. Through an unsupervised learning process, HIQL divides expert trajectories into multiple intention segments, and solves the IRL problem independently for each. Applying HIQL to simulated experiments and several real animal behavior datasets, our approach outperforms current benchmarks in behavior prediction and produces interpretable reward functions. Our results suggest that the intention transition dynamics underlying complex decision-making behavior is better modeled by a step function instead of a smoothly varying function. This advancement holds promise for neuroscience and cognitive science, contributing to a deeper understanding of decision-making and uncovering underlying brain mechanisms.

Multi-intention Inverse Q-learning for Interpretable Behavior Representation

TL;DR

latent intentions with a Markov-switching dynamic (

) and solves IRL per intention with an inner IQL solver, enabling model-free learning. Across gridworld and real mice datasets, HIQL outperforms the state-of-the-art DIRL, providing higher predictive accuracy and interpretable, step-like reward structures that reveal exploitation, exploration, and other strategies. The approach offers a scalable framework for neuroscience and cognitive science to link observed behavior to latent reward functions and brain mechanisms, while enabling model-free learning in unknown environments.

Abstract

Paper Structure (20 sections, 1 theorem, 19 equations, 8 figures, 3 algorithms)

This paper contains 20 sections, 1 theorem, 19 equations, 8 figures, 3 algorithms.

Introduction
Related work
Background
Markov decision processes.
Inverse Q-learning.
Hierarchical inverse Q-learning
Experiments and discussion
Gridworld benchmark
Real-world mice navigation benchmark
Application to mice reversal-learning behavior
Conclusion
Theoretical and technical details
Proof of Theorem \ref{['theo:hiql']}
Computing required posterior probabilities
Algorithms
...and 5 more sections

Key Result

Theorem 4.3

Solving problem (prob:hiql) is equivalent to solving a sequence of optimization problems: and

Figures (8)

Figure 1: Graphical representation of expert's decision process.
Figure 2: The gridworld environment.
Figure 3: Results for the gridworld benchmark. (a) Comparison of HIAVI and DIRL on datasets with different number of expert trajectories, represented as log-likelihood on the test dataset (mean $\pm$ standard error, 5-fold cross-validation). (b) Predicted intention dynamics from HIAVI and DIRL using the outputs from the best cross-validation fold, represented as the posterior probability of the 'abandon' intention and averaged across all trajectories (mean $\pm$ standard error across $1024$ trajectories). (c) Visualization of the ground truth and learnt state-value functions from the best cross-validation fold (top), and the corresponding EVDs (bottom, mean $\pm$ standard error, 5-fold cross-validation) from HIAVI and DIRL.
Figure 4: The labyrinth environment.
Figure 5: Results for the navigation benchmark of the water-restricted cohort. (a) Comparison of HIAVI, DIRL, and a random policy, represented as log-likelihood on the test dataset. (b) BIC as a function of the number of intentions in HIAVI. (c) Learnt policy (red arrows and crosses) in the environment and corresponding state occupancy (grey colormap) under different intentions. (d) Predicted intention dynamics from HIAVI, averaged across all $200$ trajectories. Solid and shaded curves denote the mean and standard error across $200$ trajectories. (e) Inferred intention transition matrix from HIAVI.
...and 3 more figures

Theorems & Definitions (3)

Theorem 4.3
proof
proof

Multi-intention Inverse Q-learning for Interpretable Behavior Representation

TL;DR

Abstract

Multi-intention Inverse Q-learning for Interpretable Behavior Representation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (3)