Policy Optimization via Adv2: Adversarial Learning on Advantage Functions

Matthieu Jonckheere; Chiara Mignacco; Gilles Stoltz

Policy Optimization via Adv2: Adversarial Learning on Advantage Functions

Matthieu Jonckheere, Chiara Mignacco, Gilles Stoltz

TL;DR

This paper analyzes policy optimization in episodic adversarial MDPs by formalizing a reduction to adversarial learning that can use a broad class of strategies beyond exponential weights, and by leveraging advantage functions (or $Q$-values) as inputs to these learners. It introduces Adv2, a stage-wise adversarial-learning scheme that yields regret bounds of the form $H(H+1) B_{T,A}$ when the underlying learner controls $A$-dimensional adversarial regret $B_{T,A}$. The authors present three key extensions: convergence of the last iterate under monotone-weight strategies, stronger regret notions (strongly adaptive and tracking regret) with corresponding transfers to value-function regret, and aggregation (or orchestration) of expert policies via a lifted MDP framework. They also analyze the special case of exponential weights with improved regret bounds in the episodic setting and discuss practical implications for settings with unknown transition kernels and estimated value functions, outlining directions for empirical validation of alternative policy-improvement steps. Overall, the work broadens the toolkit for adversarial MDP policy optimization and provides a roadmap for leveraging a variety of online-learning strategies while preserving theoretical guarantees, with practical implications for whether to use $Q$-values or advantage functions and how to incorporate policy aggregation.

Abstract

We revisit the reduction of learning in adversarial Markov decision processes [MDPs] to adversarial learning based on $Q$--values; this reduction has been considered in a number of recent articles as one building block to perform policy optimization. Namely, we first consider and extend this reduction in an ideal setting where an oracle provides value functions: it may involve any adversarial learning strategy (not just exponential weights) and it may be based indifferently on $Q$--values or on advantage functions. We then present two extensions: on the one hand, convergence of the last iterate for a vast class of adversarial learning strategies (again, not just exponential weights), satisfying a property called monotonicity of weights; on the other hand, stronger regret criteria for learning in MDPs, inherited from the stronger regret criteria of adversarial learning called strongly adaptive regret and tracking regret. Third, we demonstrate how adversarial learning, also referred to as aggregation of experts, relates to aggregation (orchestration) of expert policies: we obtain stronger forms of performance guarantees in this setting than existing ones, via yet another, simple reduction. Finally, we discuss the impact of the reduction of learning in adversarial MDPs to adversarial learning in the practical scenarios where transition kernels are unknown and value functions must be learned. In particular, we review the literature and note that many strategies for policy optimization feature a policy-improvement step based on exponential weights with estimated $Q$--values. Our main message is that this step may be replaced by the application of any adversarial learning strategy on estimated $Q$--values or on estimated advantage functions. We leave the empirical evaluation of these twists for future research.

Policy Optimization via Adv2: Adversarial Learning on Advantage Functions

TL;DR

-values) as inputs to these learners. It introduces Adv2, a stage-wise adversarial-learning scheme that yields regret bounds of the form

when the underlying learner controls

-dimensional adversarial regret

. The authors present three key extensions: convergence of the last iterate under monotone-weight strategies, stronger regret notions (strongly adaptive and tracking regret) with corresponding transfers to value-function regret, and aggregation (or orchestration) of expert policies via a lifted MDP framework. They also analyze the special case of exponential weights with improved regret bounds in the episodic setting and discuss practical implications for settings with unknown transition kernels and estimated value functions, outlining directions for empirical validation of alternative policy-improvement steps. Overall, the work broadens the toolkit for adversarial MDP policy optimization and provides a roadmap for leveraging a variety of online-learning strategies while preserving theoretical guarantees, with practical implications for whether to use

-values or advantage functions and how to incorporate policy aggregation.

Abstract

We revisit the reduction of learning in adversarial Markov decision processes [MDPs] to adversarial learning based on

--values; this reduction has been considered in a number of recent articles as one building block to perform policy optimization. Namely, we first consider and extend this reduction in an ideal setting where an oracle provides value functions: it may involve any adversarial learning strategy (not just exponential weights) and it may be based indifferently on

--values or on advantage functions. We then present two extensions: on the one hand, convergence of the last iterate for a vast class of adversarial learning strategies (again, not just exponential weights), satisfying a property called monotonicity of weights; on the other hand, stronger regret criteria for learning in MDPs, inherited from the stronger regret criteria of adversarial learning called strongly adaptive regret and tracking regret. Third, we demonstrate how adversarial learning, also referred to as aggregation of experts, relates to aggregation (orchestration) of expert policies: we obtain stronger forms of performance guarantees in this setting than existing ones, via yet another, simple reduction. Finally, we discuss the impact of the reduction of learning in adversarial MDPs to adversarial learning in the practical scenarios where transition kernels are unknown and value functions must be learned. In particular, we review the literature and note that many strategies for policy optimization feature a policy-improvement step based on exponential weights with estimated

--values. Our main message is that this step may be replaced by the application of any adversarial learning strategy on estimated

--values or on estimated advantage functions. We leave the empirical evaluation of these twists for future research.

Paper Structure (49 sections, 12 theorems, 83 equations, 1 figure)

This paper contains 49 sections, 12 theorems, 83 equations, 1 figure.

Introduction
Brief literature review
Adversarial MDPs / Reduction to adversarial learning.
Policy optimization.
A single adversarial-learning strategy, based on exponential weights.
Previous reductions of learning in MDPs to adversarial learning.
Contributions and outline of this article
Extension 1: convergence of the last iterate.
Extension 2: Stronger forms of regret.
The special case of exponential weights.
Extension 3: Aggregation (orchestration) of expert policies.
Empirical impacts as future research directions.
Setting and aims
Notation.
Setting.
...and 34 more sections

Key Result

Theorem 1

In the setting of Section sec:setting-main where rewards lie in $[0,1]$, if, for all $h \in [H]$, the sequential strategies $\varphi_h$ control the regret in the adversarial setting (Definition def:adv) by $B_{T,A}$ for $A$--dimensional reward vectors bounded by $H-h+1$, then the $(\varphi_h)_{h \in

Figures (1)

Figure 1: The strategy considered and studied by Shani20, as stated therein (left part): our results focus on considering alternative formulations of the policy-improvement step, based on other adversarial-learning strategies than exponential weights, and possibly based on estimated advantage functions rather than estimated $Q$--values (right part). Shani20 considers costs instead of rewards, hence the negative signs appearing when feeding adversarial learning strategies $\varphi$ designed for rewards.

Theorems & Definitions (31)

Definition 1: adversarial-learning regret bound
Example 1
Example 2
Example 3
Theorem 1
Lemma 1: Performance difference lemma
proof : Proof of Theorem \ref{['th:main']}
Remark 1
Definition 2: monotonicity of weights
Lemma 2
...and 21 more

Policy Optimization via Adv2: Adversarial Learning on Advantage Functions

TL;DR

Abstract

Policy Optimization via Adv2: Adversarial Learning on Advantage Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (31)