Table of Contents
Fetching ...

PRISM: Parallel Reward Integration with Symmetry for MORL

Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He

TL;DR

A Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels, and introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives.

Abstract

This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.

PRISM: Parallel Reward Integration with Symmetry for MORL

TL;DR

A Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels, and introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives.

Abstract

This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.
Paper Structure (34 sections, 25 theorems, 72 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 34 sections, 25 theorems, 72 equations, 7 figures, 13 tables, 1 algorithm.

Key Result

Theorem 5.8

The space $\Pi_{\text{eq}}$ has a covering number less than or equal to that of $\Pi$. Let $\mathcal{N}_{\infty,1}(\mathcal{F}, r)$ be the covering number of a function space $\mathcal{F}$ under the $l_{\infty,1}$-distance. Then, $\mathcal{N}_{\infty,1}(\Pi_{\text{eq}}, r) \le \mathcal{N}_{\infty,1}

Figures (7)

  • Figure 1: Reflectional symmetry in a two-legged agent. The left panel shows a transition from state $s$ to $s'$ under action $a$, whereas the right panel shows the reflected transition, where states and actions are transformed by $L_g$ and $K_g$, respectively.
  • Figure 2: Overview of ReSymNet.
  • Figure 3: The obtained hypervolume for various levels of sparsity amongst various dimensions.
  • Figure 4: The approximated Pareto front for dense rewards (blue dots) and sparse rewards (orange dots) for the first reward objective.
  • Figure 5: The dense (blue line) and shaped rewards (orange line) over time for mo-walker2d-v5 and the first reward objective.
  • ...and 2 more figures

Theorems & Definitions (47)

  • Definition 3.1: $l_{\infty,1}$ distance
  • Definition 3.2: covering number
  • Definition 3.3: Rademacher complexity
  • Remark 4.1
  • Remark 5.1
  • Definition 5.7: reflection-equivariant subspace
  • Theorem 5.8
  • Corollary 5.9
  • Theorem 5.10
  • Corollary 5.11
  • ...and 37 more