A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Nikhil Behari; Edwin Zhang; Yunfan Zhao; Aparna Taneja; Dheeraj Nagaraj; Milind Tambe

A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Nikhil Behari, Edwin Zhang, Yunfan Zhao, Aparna Taneja, Dheeraj Nagaraj, Milind Tambe

TL;DR

This paper introduces a Decision-Language Model (DLM) that uses large language models to dynamically tune RMAB-based resource allocations in public health via language prompts. By generating reward functions as executable code and refining them through grounded RMAB simulations and LLM reflection, the approach enables continual alignment of policy outcomes with evolving human priorities. In simulations modeled on ARMMAN data, DLM achieves near-Base policy performance across a broad set of prompts and outperforms fixed-reward baselines, illustrating the potential for rapid, community-informed policy adaptation under resource constraints. The work highlights both the practical promise and the need for careful safety, ethics, and real-world validation when applying automated policy design in public health contexts.

Abstract

Restless multi-armed bandits (RMAB) have demonstrated success in optimizing resource allocation for large beneficiary populations in public health settings. Unfortunately, RMAB models lack flexibility to adapt to evolving public health policy priorities. Concurrently, Large Language Models (LLMs) have emerged as adept automated planners across domains of robotic control and navigation. In this paper, we propose a Decision Language Model (DLM) for RMABs, enabling dynamic fine-tuning of RMAB policies in public health settings using human-language commands. We propose using LLMs as automated planners to (1) interpret human policy preference prompts, (2) propose reward functions as code for a multi-agent RMAB environment, and (3) iterate on the generated reward functions using feedback from grounded RMAB simulations. We illustrate the application of DLM in collaboration with ARMMAN, an India-based non-profit promoting preventative care for pregnant mothers, that currently relies on RMAB policies to optimally allocate health worker calls to low-resource populations. We conduct a technology demonstration in simulation using the Gemini Pro model, showing DLM can dynamically shape policy outcomes using only human prompts as input.

A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

TL;DR

Abstract

Paper Structure (40 sections, 3 theorems, 6 equations, 4 figures, 6 tables, 4 algorithms)

This paper contains 40 sections, 3 theorems, 6 equations, 4 figures, 6 tables, 4 algorithms.

Introduction
Related Work
Background
Decision-Language Model for RMABs
Problem Setting: Reward Generation for RMABs
Provided DLM Context
Multi-Agent Simulation
Reflection Stage
LLM Reward Generation Capability
Experimental Evaluation
Simulated Public Health Setting
Tasks, Baselines, and Metrics
Training Details
Evaluation of DLM Performance
LLM-Generated Rewards and Reflection in Public Health Settings
...and 25 more sections

Key Result

Proposition 1

Assume monotonicity for $V(\cdot,w^{*})$ and let $\hat{w} := \arg \max_{w \in S^{|\mathsf{Supp}(w^{*})|}}V(w,w^{*})$. There exists a transformer $T$ with constant depth $D$ and width $O(\|w^{*}\|_0K)$ which can find $\hat{w}$ with $O(\|w^{*}\|_0 K)$ samples, with oracle access to $V(\cdot,w^{*})$.

Figures (4)

Figure 1: Overview of the DLM language-conditioned reward design loop. We provide three context descriptions to the LLM: a language command (full list of commands in \ref{['tab:tasks_list']}), a list of per-arm demographic features available for proposed reward functions, and syntax cues enabling LLM reward function output directly in code. From this context, the 1) LLM then proposes 2) candidate reward functions which are used to train 3) optimal policies under proposed rewards. Trained policies are simulated to generate 4) policy outcome comparisons showing state-feature distributions over key demographic groups. Finally, we query an LLM to perform 5) self-reflectionshinn2024reflexionma2023eureka by choosing the best candidate reward aligning with the original language command; selected candidates are used as context to guide future reward generation.
Figure 2: Main results. We compute normalized reward (Section \ref{['subsec:tasks_baselines_metrics']}) for each method over 200 seeds, and report the interquartile mean (IQM) and standard error of the IQM across all runs agarwal2021deep. We compare the topline Base reward policy to the performance of DLM with No Reflection and with Reflection. We also compare to a No Action and random_colorRandom policy, and a Default policy that demonstrates how the original (fixed) reward function would perform for each new task. Our method is able to achieve near-base reward performance across tasks, and consistently outperform the fixed Default reward policy in a completely automated fashion. For some tasks, DLM with Reflection is also able to significantly improve upon zero-shot proposed reward.
Figure 3: Examples of DLM-generated reward functions vs. ground truth Base reward. Rewards reformatted for clarity; s represents the binary state, numbers are scalar multiplier quantities, and named features, each binary quantities, are shown. In some cases (ex: Older Bias) DLM may identify relevant features zeroshot, and use reflection to refine weights. Alternatively, reflection may help refine features (ex: Age Distribution Tail Emphasis). However, when prompts are ambiguous (ex: Technically Challenged), reflection may not have sufficient signal to effectively iterate; in these cases, additional human feedback may be required.
Figure 4: Full list of features with corresponding feature indices used in Base and LLM-proposed reward functions. See Table \ref{['tab:tasks_list']} for a full list of language prompts and ground truth Base reward functions.

Theorems & Definitions (7)

Definition 1: Monotonicity
Proposition 1
Proposition 2
proof
Proposition 3
proof
proof : Proof of Proposition \ref{['prop:transformer-gridsearch-bound']}

A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

TL;DR

Abstract

A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)