Table of Contents
Fetching ...

PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

Zhihao Lin, Lin Wu, Zhen Tian, Jianglin Lan

TL;DR

PrefPoE tackles the persistent exploration challenge in policy-gradient RL by learning where to explore. It introduces a two-headed policy with a shared encoder, where an advantage-guided preference head is fused with the main policy through a Product-of-Experts approach, yielding a soft trust region and focused exploration. Theoretical results show the preference converges to a Boltzmann distribution over normalized advantages, while PoE fusion reduces variance and enforces consensus, supporting stable, efficient learning. Empirically, PrefPoE delivers substantial improvements across continuous and discrete tasks with better stability and sample efficiency, while remaining a modular, plug-and-play augmentation for PPO and related methods.

Abstract

Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbf{PrefPoE}, a novel \textit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321\% on HalfCheetah-v4 (1276~$\rightarrow$~5375), +69\% on Ant-v4, +276\% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning \textit{where to explore} through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.

PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

TL;DR

PrefPoE tackles the persistent exploration challenge in policy-gradient RL by learning where to explore. It introduces a two-headed policy with a shared encoder, where an advantage-guided preference head is fused with the main policy through a Product-of-Experts approach, yielding a soft trust region and focused exploration. Theoretical results show the preference converges to a Boltzmann distribution over normalized advantages, while PoE fusion reduces variance and enforces consensus, supporting stable, efficient learning. Empirically, PrefPoE delivers substantial improvements across continuous and discrete tasks with better stability and sample efficiency, while remaining a modular, plug-and-play augmentation for PPO and related methods.

Abstract

Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbf{PrefPoE}, a novel \textit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321\% on HalfCheetah-v4 (1276~~5375), +69\% on Ant-v4, +276\% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning \textit{where to explore} through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.

Paper Structure

This paper contains 50 sections, 2 theorems, 39 equations, 10 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Assume that the advantage function $A_\text{norm}(s,a)$ is bounded by $|A_\text{norm}(s,a)| \leq A_{\max}$ for all $(s,a)$, and let $\phi$ denote the parameters of the preference network $\pi_{\text{pref}}(a|s)$. Then the loss function (eq3:pref_loss) admits a unique global minimum given by the Bolt where $Z(s) = \int \exp(\beta_1 A_{\text{norm}}(s,a) / \alpha) \mathrm{d}a$ is the partition functi

Figures (10)

  • Figure 1: Architecture of PrefPoE. A shared backbone (a common encoder $f_{\text{enc}}(s)$) feeds into main and preference policy heads, which are fused via PoE to generate actions. The preference head is trained with advantage guidance to focus exploration on high-value regions.
  • Figure 2: Entropy dynamics in HalfCheetah-v4 demonstrating PrefPoE's resistance to entropy collapse. (a) Preference entropy shows initial focusing and later broadening. (b) PPO entropy decays monotonically, converging near 1.0. (c) PoE entropy shows adaptive U-shaped dynamics, recovering and stabilizing around 1.0 (1--2M), thereby preventing the collapse observed in vanilla PPO.
  • Figure 3: Learning curves on (a) LunarLanderContinuous-v2, (b) HalfCheetah-v4, and (c) Ant-v4.
  • Figure 4: Learning curves on (a) CartPole-v1, (b) LunarLander-v2, and (c) FrozenLake-v1.
  • Figure 5: Ablation study on HalfCheetah-v4: (a) Learning curves, (b) component-wise performance contributions, and (c) training stability heatmap.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof