Table of Contents
Fetching ...

Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning

Yudan Wang, Peiyao Xiao, Hao Ban, Kaiyi Ji, Shaofeng Zou

TL;DR

This work tackles gradient conflicts in single-policy multi-task reinforcement learning by introducing MTAC, a dynamic-weighted actor-critic framework with two task-weight-update modes: CA (conflict-avoidant) and FC (fast-convergence). The authors establish finite-time convergence guarantees, showing MTAC-CA achieves an $\epsilon+\epsilon_{\text{app}}$-Pareto stationary policy in $\mathcal{O}(\epsilon^{-5})$ samples with an $\epsilon+\sqrt{\epsilon_{\text{app}}}$ CA distance, while MTAC-FC reduces this to $\mathcal{O}(\epsilon^{-3})$ samples at the cost of a constant CA distance. The analysis handles biased gradient estimates from function approximation by introducing a surrogate CA direction, enabling decomposition of CA-gap into critic- and approximation-related errors. Empirical results on the MT10 benchmark show MTAC-CA outperforms fixed-priority MTRL baselines, validating the practical benefits of dynamic weighting for multi-task RL.

Abstract

Multi-task reinforcement learning (MTRL) has shown great promise in many real-world applications. Existing MTRL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different tasks. However, these methods often suffer from the issue of \textit{gradient conflict} such that the tasks with larger gradients dominate the update direction, resulting in a performance degeneration on other tasks. In this paper, we develop a novel dynamic weighting multi-task actor-critic algorithm (MTAC) under two options of sub-procedures named as CA and FC in task weight updates. MTAC-CA aims to find a conflict-avoidant (CA) update direction that maximizes the minimum value improvement among tasks, and MTAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MTAC-CA can find a $ε+ε_{\text{app}}$-accurate Pareto stationary policy using $\mathcal{O}({ε^{-5}})$ samples, while ensuring a small $ε+\sqrt{ε_{\text{app}}}$-level CA distance (defined as the distance to the CA direction), where $ε_{\text{app}}$ is the function approximation error. The analysis also shows that MTAC-FC improves the sample complexity to $\mathcal{O}(ε^{-3})$, but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MTRL methods with fixed preference.

Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning

TL;DR

This work tackles gradient conflicts in single-policy multi-task reinforcement learning by introducing MTAC, a dynamic-weighted actor-critic framework with two task-weight-update modes: CA (conflict-avoidant) and FC (fast-convergence). The authors establish finite-time convergence guarantees, showing MTAC-CA achieves an -Pareto stationary policy in samples with an CA distance, while MTAC-FC reduces this to samples at the cost of a constant CA distance. The analysis handles biased gradient estimates from function approximation by introducing a surrogate CA direction, enabling decomposition of CA-gap into critic- and approximation-related errors. Empirical results on the MT10 benchmark show MTAC-CA outperforms fixed-priority MTRL baselines, validating the practical benefits of dynamic weighting for multi-task RL.

Abstract

Multi-task reinforcement learning (MTRL) has shown great promise in many real-world applications. Existing MTRL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different tasks. However, these methods often suffer from the issue of \textit{gradient conflict} such that the tasks with larger gradients dominate the update direction, resulting in a performance degeneration on other tasks. In this paper, we develop a novel dynamic weighting multi-task actor-critic algorithm (MTAC) under two options of sub-procedures named as CA and FC in task weight updates. MTAC-CA aims to find a conflict-avoidant (CA) update direction that maximizes the minimum value improvement among tasks, and MTAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MTAC-CA can find a -accurate Pareto stationary policy using samples, while ensuring a small -level CA distance (defined as the distance to the CA direction), where is the function approximation error. The analysis also shows that MTAC-FC improves the sample complexity to , but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MTRL methods with fixed preference.
Paper Structure (21 sections, 14 theorems, 93 equations, 2 tables, 3 algorithms)

This paper contains 21 sections, 14 theorems, 93 equations, 2 tables, 3 algorithms.

Key Result

Proposition 1

Suppose Assumptions ass:smoothandlip and asm:ergodic are satisfied. We choose $c_{t,i}=\frac{c}{\sqrt{i}}$, where $c>0$ is a constant and $i$ is the number of iterations for updating $\lambda_{t,i}$. Then, the CA distance is bounded as: where $\widehat{\nabla} J^k_{w_{t+1}}(\theta_t) =\mathbb{E}_{ d^k_{{\theta_t}}}[\phi^k(s,a)^\top w^k_{t+1}\psi_{\theta_t}(s,a)]$, $\widehat{\nabla} J_{\boldsymbol

Theorems & Definitions (22)

  • Definition 1
  • Definition 2
  • Definition 3: Function Approximation Error
  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Proposition 2: Lipschitz property xu2020improving
  • Lemma 1
  • ...and 12 more