Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Alireza Mousavi-Hosseini; Murat A. Erdogdu

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Alireza Mousavi-Hosseini, Murat A. Erdogdu

TL;DR

It is proved that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

Abstract

We study post-training linear autoregressive models with outcome and process rewards. Given a context $\boldsymbol{x}$, the model must predict the response $\boldsymbol{y} \in Y^N$, a sequence of length $N$ that satisfies a $γ$ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood $α$, a variant of policy gradient (PG) can achieve likelihood $1 - \varepsilon$ with an essentially minimax optimal number of reward queries $\tilde{O}((α^{-1} + \varepsilon^{-1})/γ^2)$. However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in $N$ to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in $N$ via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

TL;DR

Abstract

We study post-training linear autoregressive models with outcome and process rewards. Given a context

, the model must predict the response

, a sequence of length

that satisfies a

margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood

, a variant of policy gradient (PG) can achieve likelihood

with an essentially minimax optimal number of reward queries

. However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in

to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in

via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

Paper Structure (8 sections, 1 equation, 1 figure)

This paper contains 8 sections, 1 equation, 1 figure.

Introduction
Our Contributions
Additional Related Works
Online Linear Classification with Bandit Feedback.
SGD for Separable Data.
Policy Gradient Analysis.
Notation.
Autoregressive Linear Models and Pre-Training

Figures (1)

Figure 1: The evolution of the model's likelihood to generate the correct response over different contexts throughout PG. (Left) The average likelihood over samples with initial likelihood $\approx 0$ under base model. On-policy PG with PRM is able to improve this average while with ORM stays at $0$. (Center) The comparison between the expected test error with ORM where the error plateaus at a threshold, and with PRM where the error continues to decrease. (Right) The likelihood of individual samples throughout PG with ORM, where the color denotes initial likelihood. Experiment details are presented in \ref{['sec:experiments']}.

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

TL;DR

Abstract

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Authors

TL;DR

Abstract

Table of Contents

Figures (1)