Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

Shresth Verma; Niclas Boehmer; Lingkai Kong; Milind Tambe

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe

TL;DR

This work tackles the problem of designing reward functions for Restless Multi-Armed Bandits (RMABs) that reflect multi-objective human preferences. It introduces the Social Choice Language Model (SCLM), a two-stage framework: an LLM-based generator creates a pool of candidate reward functions, and an adjudicator selects one using social welfare aggregation over per-clause alignment scores. Alignment is computed via simulation-based (SCLM-SIM) and LLM-based (SCLM-LLM) scorers, with safeguards for unintended utility shifts and potential utility drops, and theoretical guarantees on selection quality under scoring noise. Empirical results on synthetic and real-world RMAB problems demonstrate that SCLM reliably yields reward functions that are more aligned, balanced, and less prone to bias than purely LLM-driven approaches, while enabling explicit control over the trade-offs via welfare function choices. The framework thus provides a transparent, human-centric method for multi-objective reward design in multiagent planning with RMABs, with potential applicability to other domains requiring balanced preference integration.

Abstract

LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

TL;DR

Abstract

Paper Structure (47 sections, 2 theorems, 28 equations, 16 figures, 8 tables)

This paper contains 47 sections, 2 theorems, 28 equations, 16 figures, 8 tables.

Related Works
LLM-enhanced RL
Multi-Objective Reinforcement Learning (MORL)
Inverse Reinforcement Learning (IRL)
Preliminaries
Problem Statement & Challenges
Social Choice Language Model (SCLM)
Generator
Adjudicator
Selection via Social Welfare Function
Computing Alignment Scores
Simulator Scorer Model (SCLM-SIM)
LLM Scorer Model (SCLM-LLM)
Preventing Unintended Utility Shifts and Utility Drop
Error Bounds for Adjudicator's Selection
...and 32 more sections

Key Result

Proposition 1

The relative regret is bounded by $1-\alpha^2$.

Figures (16)

Figure 1: Tradeoffs between prioritization clauses.
Figure 2: Utility feature distributions for default reward function (orange) and reward function returned for prompt "Prioritize agents with low income" (blue) by DLM baseline. $x$-axis depicts feature value and $y$-axis total utility generated by agents with this value.
Figure 3: Overview of SCLM. In step 1, preference prompt is passed to the generator, which performs an evolutionary search to create a pool $\mathcal{R}$ of candidate reward functions. In step 2, these functions are passed to the adjudicator where a scorer model (e.g., the simulator or LLM scorer) computes the alignment scores. In step 3, a user-defined social welfare function selects a reward function based on the alignment scores.
Figure 4: Results comparing the quality of reward design methods for composite prioritization prompts. Results are averaged across $180=12\cdot 15$ values: $12$ composite prompts on $15$ RMAB instances (from $3$ datasets). Error bars represent std-error.
Figure 5: Utility feature distributions for the default reward function (orange) and reward function returned for prompt "Prioritize agents with low income" (blue). The $x$-axis depicts the values of the feature and the $y$-axis the total utility generated by all agents with this feature value.
...and 11 more figures

Theorems & Definitions (5)

Example 1
Proposition 1
Lemma 2
proof
proof

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

TL;DR

Abstract

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (5)