Table of Contents
Fetching ...

Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

Chenyang Zhao, Vinny Cahill, Ivana Dusparic

TL;DR

It is shown that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering, and offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.

Abstract

Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.

Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

TL;DR

It is shown that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering, and offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.

Abstract

Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.
Paper Structure (13 sections, 1 equation, 4 figures)

This paper contains 13 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Our framework of RLAIF for multi-objective tasks: the policy interacts with the environment, transitions enter the replay buffer $\mathcal{B}$; segment pairs are annotated by an LLM and stored into a preference buffer $\mathcal{D}$; the reward model $r_\psi$ updates from $\mathcal{D}$; re-scores $\mathcal{B}$, and the policy is trained with RL algorithms with the updated replay buffer. During annotation, we make the optimization criteria explicit, stating the task objectives and desired trade-offs, so the learned reward reflects user expectations and aligns behaviour across multiple objectives without bespoke reward engineering.
  • Figure 2: An illustrated workflow of the preference annotation process. The annotator is given considered objectives, the overall goal, and descriptions of the sampled pair of segments $\{\sigma^1, \sigma^2\}$, and outputs the preference label $y\in\{0,1,2\}$.
  • Figure 3: An exemplar text description of a traffic segment with length 1 in the task of balancing traffic throughput and environmental impact. The text description includes the observations, explained with their semantic meanings, as well as information on throughput and carbon emissions.
  • Figure 4: Left and middle: Average traffic throughput and CO2 emission in throughput-emission scenario through learning process. RLAIF learns better performances compared with single objective baselines, and falls short only of the Linear baseline, which requires reward engineering. Right: Comparison between 5 policies trained with different user specifications in lane priorities scenario with instructions towards different lane priorities.