Table of Contents
Fetching ...

Oracle-Robust Online Alignment for Large Language Models

Zimeng Li, Mudit Gaur, Vaneet Aggarwal

TL;DR

This paper introduces a pointwise oracle uncertainty set in this problem and forms an oracle-robust online alignment objective as a worst-case optimization problem and shows that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty.

Abstract

We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement problem due to the coupling between data collection and policy updates. Recently, the problem has been reduced to tractable single-level objective in the SAIL (Self-Improving Efficient Online Alignment) framework. In this paper, we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objective as a worst-case optimization problem. For log-linear policies, we show that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty. We develop projected stochastic composite updates for the resulting weakly convex objective and prove $\widetilde{O}(\varepsilon^{-2})$ oracle complexity for reaching approximate stationarity.

Oracle-Robust Online Alignment for Large Language Models

TL;DR

This paper introduces a pointwise oracle uncertainty set in this problem and forms an oracle-robust online alignment objective as a worst-case optimization problem and shows that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty.

Abstract

We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement problem due to the coupling between data collection and policy updates. Recently, the problem has been reduced to tractable single-level objective in the SAIL (Self-Improving Efficient Online Alignment) framework. In this paper, we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objective as a worst-case optimization problem. For log-linear policies, we show that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty. We develop projected stochastic composite updates for the resulting weakly convex objective and prove oracle complexity for reaching approximate stationarity.
Paper Structure (42 sections, 14 theorems, 177 equations, 1 algorithm)

This paper contains 42 sections, 14 theorems, 177 equations, 1 algorithm.

Key Result

Theorem 4.1

Recall $L^{\mathrm{SAIL}}(\theta)$ denotes the non-robust SAIL objective and $L^{W}_\rho(\theta)$ denotes the robust objective defined by the worst-case oracle in $U^{W}(P^\star,\rho)$.Define the pairwise score and the robust penalty Then under Assumptions ass:nondegenerate_oracle,ass:loglinear_sail, the robust objective admits the exact decomposition

Theorems & Definitions (45)

  • Remark 3.1
  • Theorem 4.1: Decomposition of $L^{W}_\rho(\theta)$
  • proof : Proof sketch of Theorem \ref{['thm:decomposition_LW']}
  • Remark 4.2: Interpretation of the decomposition
  • Remark 4.3
  • Example 4.4
  • Remark 4.5
  • Remark 4.6
  • Theorem 4.7: Weak convexity of the robust penalty
  • proof : Proof sketch of Theorem \ref{['thm:weak_convexity_R']}
  • ...and 35 more