A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health
Nikhil Behari, Edwin Zhang, Yunfan Zhao, Aparna Taneja, Dheeraj Nagaraj, Milind Tambe
TL;DR
This paper introduces a Decision-Language Model (DLM) that uses large language models to dynamically tune RMAB-based resource allocations in public health via language prompts. By generating reward functions as executable code and refining them through grounded RMAB simulations and LLM reflection, the approach enables continual alignment of policy outcomes with evolving human priorities. In simulations modeled on ARMMAN data, DLM achieves near-Base policy performance across a broad set of prompts and outperforms fixed-reward baselines, illustrating the potential for rapid, community-informed policy adaptation under resource constraints. The work highlights both the practical promise and the need for careful safety, ethics, and real-world validation when applying automated policy design in public health contexts.
Abstract
Restless multi-armed bandits (RMAB) have demonstrated success in optimizing resource allocation for large beneficiary populations in public health settings. Unfortunately, RMAB models lack flexibility to adapt to evolving public health policy priorities. Concurrently, Large Language Models (LLMs) have emerged as adept automated planners across domains of robotic control and navigation. In this paper, we propose a Decision Language Model (DLM) for RMABs, enabling dynamic fine-tuning of RMAB policies in public health settings using human-language commands. We propose using LLMs as automated planners to (1) interpret human policy preference prompts, (2) propose reward functions as code for a multi-agent RMAB environment, and (3) iterate on the generated reward functions using feedback from grounded RMAB simulations. We illustrate the application of DLM in collaboration with ARMMAN, an India-based non-profit promoting preventative care for pregnant mothers, that currently relies on RMAB policies to optimally allocate health worker calls to low-resource populations. We conduct a technology demonstration in simulation using the Gemini Pro model, showing DLM can dynamically shape policy outcomes using only human prompts as input.
