Align and Filter: Improving Performance in Asynchronous On-Policy RL

Homayoun Honari; Roger Creus Castanyer; Michael Przystupa; Michael Noukhovitch; Pablo Samuel Castro; Glen Berseth

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Homayoun Honari, Roger Creus Castanyer, Michael Przystupa, Michael Noukhovitch, Pablo Samuel Castro, Glen Berseth

TL;DR

This paper identifies the sources of policy lag caused by distributed learning and high update frequency and proposes total Variation-based Advantage aligned Constrained policy Optimization as a practical approach to mitigate policy lag and empirically validate the method offers better robustness to policy lag.

Abstract

Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.

Align and Filter: Improving Performance in Asynchronous On-Policy RL

TL;DR

Abstract

Paper Structure (35 sections, 7 theorems, 37 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 35 sections, 7 theorems, 37 equations, 12 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Background
Asynchronous On-Policy RL
Trust Region Policy Optimization
Proximal Policy Optimization
Methodology
PPO performance with off-policy data
Forward Policy Lag
Backward Policy Lag
Off-Policy Performance Difference
Advantage Realignment
Filtering-based Constrained Policy Optimization
Experiments
Backward Policy Lag in MuJoCo
...and 20 more sections

Key Result

Lemma 3.1

kakade2002approximately For any two policies $\pi$ and $\pi'$ and the initial state distribution $\mu$:

Figures (12)

Figure 1: (left) Simulated Asynchronous RL setup. After the end of each training phase we store the weights of the new policy in the policy buffer. Subsequently, we sample random policies from the buffer and assign them to the actors. We then generated the trajectories in a synchronous fashion. The setup aims to simulate an asynchronous RL training setup to control the backward policy lag. (right) The typical asynchronous RL training setup. The actors generate the trajectories and train the policy simultaneously. The actor would receive the new policy whenever it is ready. Therefore, the new dataset would contain trajectories from older policies.
Figure 2: Algorithmic Choices in VACO. (Top) VACO vs. PPO Clipping: To maintain a fixed Total Variation (TV) divergence, VACO (shown as 'TV Filtering') selectively removes gradients that would increase the TV divergence. In contrast, PPO naively clips gradients if their policy ratio exceeds a predefined threshold. (Bottom) IMPALA vs. Advantage Realignment: While IMPALA estimates advantage values for the most recent policy using an asynchronously generated dataset, VACO's 'Advantage Realignment' first aligns the dataset to the initial policy of the optimization process, then iteratively optimizes based on this aligned dataset. This approach significantly reduces the computational load compared to IMPALA's on-the-fly realignment.
Figure 3: With more backward policy lag (higher degree of asynchronicity) VACO achieves better performance on the aggregate metrics across various MuJoCo tasks. Higher Median, IQM, and Mean values and lower Optimality Gap imply better performance. The scores are computed over 100M steps across 10 independent random seeds and the shades represent 95% confidence intervals of the metrics.
Figure 4: IQM values of the comparison algorithms in the simulated asynchronous setups during the training process. The scores are computed over 100M steps across 10 independent random seeds and the shades represent the 95% confidence interval.(bottom right) IQM values of the Area under the curve of the normalized return plots. Higher values imply better sample efficiency during the training process.
Figure 5: VACO improves over PPO-clipping for training LLMs to reason on GSM8k. (Top) Forward policy lag can improve training efficiency at the cost of eval performance. VACO maintains higher performance as lag increases. (Bottom) PPO-clip is always clipping, proportional to the forward lag. VACO filters more rarely, enabling learning from highly policy-lagged samples, but still maintains stability by filtering a larger part of the batch when activated.
...and 7 more figures

Theorems & Definitions (7)

Lemma 3.1
Theorem 3.2
Lemma 4.1
Lemma 4.2
Lemma 2.1
Theorem 2.2
Theorem 2.3

Align and Filter: Improving Performance in Asynchronous On-Policy RL

TL;DR

Abstract

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (7)