Reward Design for Justifiable Sequential Decision-Making

Aleksa Sukovic; Goran Radanovic

Reward Design for Justifiable Sequential Decision-Making

Aleksa Sukovic, Goran Radanovic

TL;DR

This work addresses the challenge of designing RL rewards that yield decisions that are easy to justify with concise evidence. It proposes a debate-based reward where two argumentative agents present evidence for competing actions, and a judge proxy derives a justifiability signal $r^d$ from the evidence; this signal is combined with the environment reward via $\mathcal{R}_J = \sum_t \gamma^t\big[(1-\lambda) r^e(s_t,a_t) + \lambda r^d(s_t,a_t,a^B_t)\big]$. A learned judge $\mathcal{J}_\theta$ is trained from a preference dataset using a Bradley-Terry formulation, and contextual argumentative policies are learned to solve debate games. The approach is evaluated on septic treatment tasks using MIMIC-III data, demonstrating that debate-based feedback improves the justifiability of learned policies with limited information about the full state, while maintaining competitive task performance; multi-agent debate yields robust, refutation-resistant evidence, and debate-based explanations outperform SHAP in aligning with human preferences. Overall, the work shows that debate-based reward modeling can produce more easily corroborated decisions, enabling effective human-in-the-loop validation in high-stakes domains like healthcare.

Abstract

Equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. Furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. In this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. This reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. In the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. Given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. We demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. We show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. Moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. This suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. Lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.

Reward Design for Justifiable Sequential Decision-Making

TL;DR

from the evidence; this signal is combined with the environment reward via

. A learned judge

is trained from a preference dataset using a Bradley-Terry formulation, and contextual argumentative policies are learned to solve debate games. The approach is evaluated on septic treatment tasks using MIMIC-III data, demonstrating that debate-based feedback improves the justifiability of learned policies with limited information about the full state, while maintaining competitive task performance; multi-agent debate yields robust, refutation-resistant evidence, and debate-based explanations outperform SHAP in aligning with human preferences. Overall, the work shows that debate-based reward modeling can produce more easily corroborated decisions, enabling effective human-in-the-loop validation in high-stakes domains like healthcare.

Abstract

Paper Structure (36 sections, 6 equations, 6 figures, 3 tables)

This paper contains 36 sections, 6 equations, 6 figures, 3 tables.

Introduction
Related Work
Formal Setup
Agents
Reward Modeling via Debate
Learning Framework
Preference Dataset
Judge Model
Argumentative Agent
Method
Experiments
Environmental Setup
Experiment 1: Effectiveness of Tasks Policies
Experiment 2: Debate-Based Feedback vs. State-Based Feedback
Experiment 3: Effectiveness of argumentative policies
...and 21 more sections

Figures (6)

Figure 1: To obtain a debate reward $r_t^d$ in the state $s_t$, two argumentative agents $A_1$ and $A_2$ take turns proposing supporting evidence (depicted as triangles) for two decisions, up to a predefined limit (here, $3$ evidence each). Then, a positive debate reward is issued whenever a proxy of a judge $\mathcal{J}$ considers action $a_t$, taken by the justifiable policy $\pi^J$, better justified than action $a_t^B$ taken by the baseline policy $\pi^B$. This reward is then mixed with the environment reward $r_t^e$ via debate coefficient $\lambda$, yielding the final reward $r_t$ used to train the justifiable agent.
Figure 2: Evaluation of justifiable policies. For (b)-(d), the confidence intervals represent $\pm 2$ standard errors of the mean over $5$ random seeds. (a) Policy performance as measured by WIS evaluation on a held-out test set with $\pm 1$ terminal rewards for every patient discharge or death. The mean and standard deviation are reported over $5$ random seeds. (b) Percent of times judge preferred decisions of justifiable policies (i.e., $\lambda > 0.0$) compared to those of the baseline policy (i.e., $\lambda=0.0$). (c) (d) Observed patient mortality (y-axis) against variations in IV/VC treatment doses prescribed by clinicians compared to the recommendations of learned policies (x-axis).
Figure 3: Performance of policies trained with state-based feedback compared to debate-based feedback, as measured by the weighted importance sampling evaluation on a held-out test set with $\pm 1$ terminal rewards for every patient discharge or death. The mean and standard deviation are reported over $5$ random seeds.
Figure 4: (a) Fraction of aligned decisions of policies trained with debate- and state-based feedback. Confidence intervals (CI) represent $\pm 2$ standard errors of the mean over $5$ random seeds. (b) Accuracy of the judge in predicting the preferred action, with and without the confuser agent, with CI representing $\pm 2$ standard errors of the mean estimate. (c) Effectiveness of SHAP-based explanations when used to justify a decision, as measured by the judge's accuracy, with CI representing $\pm 2$ standard errors of the mean estimate.
Figure 5: Quantitative and qualitative evaluation of policies trained using debate-based rewards limited to $L=4$ evidence. For (f)-(h), the confidence intervals represent $\pm 2$ standard errors of the mean over $5$ random seeds. (a)-(d) Performance of policies trained with $L=4$ and $L=6$ evidence, as measured by WIS evaluation on a held-out test set with $\pm1$ terminal rewards for every patient discharge or death. The mean and standard deviation are reported over $5$ random seeds. (e) Accuracy of the judge in predicting the preferred action using $4$ proposed evidence, with and without the confuser agent. The CI represent $\pm 2$ standard errors of the mean estimate. (f) Percent of times judge preferred decisions of justifiable policies (i.e., $\lambda > 0.0$) compared to those of the baseline policy (i.e., $\lambda = 0.0$). (g) (h) Observed patient mortality (y-axis) against variations in IV/VC treatment doses prescribed by clinicians compared to the recommendations of learned policies (x-axis).
...and 1 more figures

Reward Design for Justifiable Sequential Decision-Making

TL;DR

Abstract

Reward Design for Justifiable Sequential Decision-Making

Authors

TL;DR

Abstract

Table of Contents

Figures (6)