Table of Contents
Fetching ...

Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

Mauricio Tec, Guojun Xiong, Haichuan Wang, Francesca Dominici, Milind Tambe

TL;DR

Rule-Bottleneck Reinforcement Learning (RBRL) addresses the interpretability gap in resource-allocation RL by co-optimizing decisions and explanations. It uses an LLM to generate a diverse set of candidate rules at each step, an attention-based policy to select among them, and chain-of-thought style outputs to justify environment actions; a rule-based reward derived from LLM judgments guides learning via Soft Actor-Critic on an augmented state that includes the rule set. The approach yields competitive decision performance compared with deep RL baselines and outperforms LLM fine-tuning in online settings, while substantially improving explanation quality and reducing hallucinations. Experiments across HeatAlerts and WearableDeviceAssignment domains, together with human surveys, demonstrate improved trust and interpretability without prohibitive computational costs. These results suggest a practical path toward deploying AI-assisted decision-support systems in high-stakes, budget-constrained environments where transparency and accountability are critical.

Abstract

Deep Reinforcement Learning (RL) is remarkably effective in addressing sequential resource allocation problems in domains such as healthcare, public policy, and resource management. However, deep RL policies often lack transparency and adaptability, challenging their deployment alongside human decision-makers. In contrast, Language Agents, powered by large language models (LLMs), provide human-understandable reasoning but may struggle with effective decision making. To bridge this gap, we propose Rule-Bottleneck Reinforcement Learning (RBRL), a novel framework that jointly optimizes decision and explanations. At each step, RBRL generates candidate rules with an LLM, selects among them using an attention-based RL policy, and determines the environment action with an explanation via chain-of-thought reasoning. The RL rule selection is optimized using the environment rewards and an explainability metric judged by the LLM. Evaluations in real-world scenarios highlight RBRL's competitive performance with deep RL and efficiency gains over LLM fine-tuning. A survey further confirms the enhanced quality of its explanations.

Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

TL;DR

Rule-Bottleneck Reinforcement Learning (RBRL) addresses the interpretability gap in resource-allocation RL by co-optimizing decisions and explanations. It uses an LLM to generate a diverse set of candidate rules at each step, an attention-based policy to select among them, and chain-of-thought style outputs to justify environment actions; a rule-based reward derived from LLM judgments guides learning via Soft Actor-Critic on an augmented state that includes the rule set. The approach yields competitive decision performance compared with deep RL baselines and outperforms LLM fine-tuning in online settings, while substantially improving explanation quality and reducing hallucinations. Experiments across HeatAlerts and WearableDeviceAssignment domains, together with human surveys, demonstrate improved trust and interpretability without prohibitive computational costs. These results suggest a practical path toward deploying AI-assisted decision-support systems in high-stakes, budget-constrained environments where transparency and accountability are critical.

Abstract

Deep Reinforcement Learning (RL) is remarkably effective in addressing sequential resource allocation problems in domains such as healthcare, public policy, and resource management. However, deep RL policies often lack transparency and adaptability, challenging their deployment alongside human decision-makers. In contrast, Language Agents, powered by large language models (LLMs), provide human-understandable reasoning but may struggle with effective decision making. To bridge this gap, we propose Rule-Bottleneck Reinforcement Learning (RBRL), a novel framework that jointly optimizes decision and explanations. At each step, RBRL generates candidate rules with an LLM, selects among them using an attention-based RL policy, and determines the environment action with an explanation via chain-of-thought reasoning. The RL rule selection is optimized using the environment rewards and an explainability metric judged by the LLM. Evaluations in real-world scenarios highlight RBRL's competitive performance with deep RL and efficiency gains over LLM fine-tuning. A survey further confirms the enhanced quality of its explanations.

Paper Structure

This paper contains 31 sections, 1 theorem, 11 equations, 15 figures, 4 tables, 3 algorithms.

Key Result

Theorem 4.1

The state transition of the RBRL MDP can be calculated as where $P( {\mathcal{R}}_{t+1}|{\mathbf{s}}_{t+1})= \pi_\textrm{LLM}({\mathcal{R}}_{t+1} | {\mathbf{p}}_t, \pmb{\tau}_t)$ is the probability of the LLM generating rule set ${\mathcal{R}}_{t+1}$ provided the state ${\mathbf{s}}_{t+1}$, $P({\mathbf{s}}_{t+1}|{\mathbf{a}}^{\text{env}},{\mathbf{s}}_t)$ i

Figures (15)

  • Figure 1: Overview of the RBRL framework for joint sequential decision-making and explanation generation at time instance $t$. Starting with current state ${\mathbf{s}}_t$, a state-to-language descriptor generates lang(${\mathbf{s}}_t$), which is used to create the input prompt ${\mathbf{p}}_t$. The LLM processes ${\mathbf{p}}_t$ to produce a thought $\pmb{\tau}_t$ and a set of candidate rules ${\mathcal{R}}_t$ . An attention-based policy network selects a rule ${\mathbf{a}}^{\text{rule}}_t$ , which is used to derive an executable action ${\mathbf{a}}^{\text{env}}_t$ satisfying the budget constraint $B({\mathbf{s}}_t)$ for the environment and a human-readable explanation $\pmb{\ell}_t^{expl}$, while also providing a rule reward $r_t^{\text{rule}}$ . The environment transitions to the next state ${\mathbf{s}}_{t+1}$ , returning an environment reward $r_t^{\text{env}}$ . This process is repeated iteratively at subsequent time steps.
  • Figure 2: Examples of task prompts and generated rules.
  • Figure 3: Overview of the Rule Selection step. The current state is encoded as a key vector, while candidate rules are encoded as Queries using a text embedding API (e.b., BERT sentence embedding). An attention-based policy network $\pi_\theta$ trained with SAC computes a probability distribution over the candidate rules, enabling the selection of the most suitable rule for decision-making and explanation.
  • Figure 4: Results from Q1. Main comparison of RBRL on three resource allocation problems. The plots show the mean and standard error across three seeds, using exponentially weighted moving averages with a half-life of 100.
  • Figure 5: Additional experiments and ablations. (a) Comparison of RBRL with thoughts-based RL (TBRL) and the baseline rule-based LLM without RL training; (b) comparison against LLM finetuning with PPO at the token level from the environment reward with CoT generation for the Mimic environment; (c) shows the effect of removing the rule reward in the HeatAlerts environments. For (a) and (c), we show distribution of rewards in the last 20% training steps.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Remark 3.1
  • Theorem 4.1