Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents
Mauricio Tec, Guojun Xiong, Haichuan Wang, Francesca Dominici, Milind Tambe
TL;DR
Rule-Bottleneck Reinforcement Learning (RBRL) addresses the interpretability gap in resource-allocation RL by co-optimizing decisions and explanations. It uses an LLM to generate a diverse set of candidate rules at each step, an attention-based policy to select among them, and chain-of-thought style outputs to justify environment actions; a rule-based reward derived from LLM judgments guides learning via Soft Actor-Critic on an augmented state that includes the rule set. The approach yields competitive decision performance compared with deep RL baselines and outperforms LLM fine-tuning in online settings, while substantially improving explanation quality and reducing hallucinations. Experiments across HeatAlerts and WearableDeviceAssignment domains, together with human surveys, demonstrate improved trust and interpretability without prohibitive computational costs. These results suggest a practical path toward deploying AI-assisted decision-support systems in high-stakes, budget-constrained environments where transparency and accountability are critical.
Abstract
Deep Reinforcement Learning (RL) is remarkably effective in addressing sequential resource allocation problems in domains such as healthcare, public policy, and resource management. However, deep RL policies often lack transparency and adaptability, challenging their deployment alongside human decision-makers. In contrast, Language Agents, powered by large language models (LLMs), provide human-understandable reasoning but may struggle with effective decision making. To bridge this gap, we propose Rule-Bottleneck Reinforcement Learning (RBRL), a novel framework that jointly optimizes decision and explanations. At each step, RBRL generates candidate rules with an LLM, selects among them using an attention-based RL policy, and determines the environment action with an explanation via chain-of-thought reasoning. The RL rule selection is optimized using the environment rewards and an explainability metric judged by the LLM. Evaluations in real-world scenarios highlight RBRL's competitive performance with deep RL and efficiency gains over LLM fine-tuning. A survey further confirms the enhanced quality of its explanations.
