Table of Contents
Fetching ...

Rationality Measurement and Theory for Reinforcement Learning Agents

Kejiang Qian, Amos Storkey, Fengxiang He

TL;DR

This work introduces a formal rationality framework for reinforcement learning by defining perfectly rational actions and two deployment-focused measures: expected rational value risk and the rational risk gap. It decomposes the gap into extrinsic and intrinsic components, upper-bounded respectively by $L_s H\cdot W_1(p_0^{\dagger},p_0)+H^2L_s(L_p+1)\cdot W_1(p^{\dagger},p)$ and $L_\Pi H\sqrt{\log|\mathcal{A}|}+2\sum_{h=1}^H\hat{\mathfrak{R}}_h(\mathcal{Q}_\Pi)+3H^2\sqrt{\frac{\log(H/\delta)}{2T}}$. The authors provide proofs, corollaries, and asymptotic results, and validate hypotheses that regularisation and domain randomisation improve rationality while environment shifts harm it. An empirical study with DQN on Taxi-v3 and Cliff Walking supports the theory, offering practical guidance for improving deployment-time rationality in RL systems. The work also connects to sim-to-real transfer challenges and lays groundwork for future robustness-focused rationality analyses.

Abstract

This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the $1$-Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, $\ell_2$ regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.

Rationality Measurement and Theory for Reinforcement Learning Agents

TL;DR

This work introduces a formal rationality framework for reinforcement learning by defining perfectly rational actions and two deployment-focused measures: expected rational value risk and the rational risk gap. It decomposes the gap into extrinsic and intrinsic components, upper-bounded respectively by and . The authors provide proofs, corollaries, and asymptotic results, and validate hypotheses that regularisation and domain randomisation improve rationality while environment shifts harm it. An empirical study with DQN on Taxi-v3 and Cliff Walking supports the theory, offering practical guidance for improving deployment-time rationality in RL systems. The work also connects to sim-to-real transfer challenges and lays groundwork for future robustness-focused rationality analyses.

Abstract

This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the -Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.
Paper Structure (48 sections, 17 theorems, 90 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 48 sections, 17 theorems, 90 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

If an action is perfectly rational, its rational value loss is zero.

Figures (5)

  • Figure 1: Reward curves of DQN under different regularisation and domain randomisation techniques in Taxi-v3 and Cliff Walking environments.
  • Figure 2: Rational risk gap of DQN under different regularisation and domain randomisation techniques in Taxi-v3 and Cliff Walking environments.
  • Figure 3: Rational risk gap of DQN across different environment levels in Taxi-v3 and Cliff Walking environments. We evaluate DQN under increasing challenge levels of training environments (0%, 10%, 30%, 50%, 70%), presenting the probability of action randomisation during training.
  • Figure 4: Reward curves of DQN across different environment levels in Taxi-v3 and Cliff Walking environments. We evaluate DQN under increasing challenge levels of training environments (0%, 10%, 30%, 50%, 70%), presenting the probability of action randomisation during training.
  • Figure 5: Empirical rational value risk of DQN across different environment levels in Taxi-v3 and Cliff Walking environments. We evaluate DQN under increasing challenge levels of training environments (0%, 10%, 30%, 50%, 70%), presenting the probability of action randomisation during training.

Theorems & Definitions (39)

  • Definition 1: $p$-Wasserstein distance
  • Definition 2: Total Variation (TV) distance
  • Definition 3: Kullback-Leibler (KL) divergence
  • Definition 4: empirical Rademacher complexity
  • Definition 5: perfectly rational action
  • Remark 1
  • Definition 6: rational value loss
  • Lemma 1
  • Definition 7: expected rational value loss
  • Definition 8: empirical rational value loss
  • ...and 29 more