Table of Contents
Fetching ...

Polychromic Objectives for Reinforcement Learning

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh

TL;DR

The paper tackles entropy collapse during RL fine-tuning by introducing set-reinforcement learning and a polychromic objective that jointly rewards performance and diversity across a set of trajectories. It presents Polychromic PPO, a PPO-based algorithm that uses vine sampling to collect on-policy trajectory sets and a modified, set-wide advantage to optimize the polychromic objective. The approach yields higher reward and success rates, greater diversity as measured by pass@$k$ coverage, and improved robustness to initial-state perturbations across BabyAI, Minigrid, and Algorithmic Creativity. These results demonstrate that explicitly balancing exploration and refinement in a set-valued objective can maintain and exploit a diverse repertoire of strategies, with practical implications for RL fine-tuning on complex, long-horizon tasks.

Abstract

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$k$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.

Polychromic Objectives for Reinforcement Learning

TL;DR

The paper tackles entropy collapse during RL fine-tuning by introducing set-reinforcement learning and a polychromic objective that jointly rewards performance and diversity across a set of trajectories. It presents Polychromic PPO, a PPO-based algorithm that uses vine sampling to collect on-policy trajectory sets and a modified, set-wide advantage to optimize the polychromic objective. The approach yields higher reward and success rates, greater diversity as measured by pass@ coverage, and improved robustness to initial-state perturbations across BabyAI, Minigrid, and Algorithmic Creativity. These results demonstrate that explicitly balancing exploration and refinement in a set-valued objective can maintain and exploit a diverse repertoire of strategies, with practical implications for RL fine-tuning on complex, long-horizon tasks.

Abstract

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.

Paper Structure

This paper contains 26 sections, 6 theorems, 42 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Lemma 3.2

Given any two policies $\pi_\theta$ and $\pi_\beta$ and a fixed initial state $s_0$, under any set objective function $f$,

Figures (6)

  • Figure 1: The set value of a state (circled) is the expected discounted return of the subtree (highlighted) rooted in this state.
  • Figure 2: Results on Algorithmic Creativity. Bars show normalized values for each metric, with raw values above each bar.
  • Figure 3: Pass@$k$ on BabyAI tasks. Top: methods without UCB. Bottom: methods with UCB. Columns show Goto, Pickup, Synthseq, and Bosslevel. Each curve is pass rate vs. number of attempts.
  • Figure 4: Pass@$k$ results on Algorithmic Creativity. For validity pass@$k$ and creativity pass@$k$, the agent gets a pass if at least one of the $k$ attempts was a valid and creative triangle, respectively. In diff@$k$ evaluation, we evaluate the number of generations that were unique given $k$ attempts.
  • Figure 5: Example BabyAI environments and their missions.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 3.1
  • Lemma 3.2
  • Proposition 5.1
  • Definition 5.2
  • Proposition 5.3
  • Proposition 5.4
  • Definition 5.5
  • Lemma C.1
  • Lemma C.2