Table of Contents
Fetching ...

Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

Youngjae Cho, Jongsuk Kim, Ji-Hoon Kim

TL;DR

GAPO tackles the brittleness of offline preference alignment by replacing a fixed reference with a dynamic geometric anchor, measuring local stability via the Anchor Gap. By adversarially perturbing the current policy within a small radius and reweighting each preference signal accordingly, GAPO downweights brittle, noisy signals and emphasizes robust semantic preferences. The method preserves the gradient direction of standard objectives while adaptively scaling updates by instance-specific stability, with theoretical ties to local geometry and empirical improvements across instruction-following and reasoning benchmarks, including resilience to label noise. The practical impact lies in improved robustness and data valuation, with a controllable computational cost, suggesting geometry-aware stability as a general principle for reliable preference-based alignment.

Abstract

Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.

Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

TL;DR

GAPO tackles the brittleness of offline preference alignment by replacing a fixed reference with a dynamic geometric anchor, measuring local stability via the Anchor Gap. By adversarially perturbing the current policy within a small radius and reweighting each preference signal accordingly, GAPO downweights brittle, noisy signals and emphasizes robust semantic preferences. The method preserves the gradient direction of standard objectives while adaptively scaling updates by instance-specific stability, with theoretical ties to local geometry and empirical improvements across instruction-following and reasoning benchmarks, including resilience to label noise. The practical impact lies in improved robustness and data valuation, with a controllable computational cost, suggesting geometry-aware stability as a general principle for reliable preference-based alignment.

Abstract

Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
Paper Structure (44 sections, 1 theorem, 20 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 44 sections, 1 theorem, 20 equations, 3 figures, 11 tables, 1 algorithm.

Key Result

Theorem 5.1

Consider the ideal worst-case perturbation for an instance $i$, defined as $\epsilon_i^* = \arg\min_{\|\epsilon\|_2 \le \rho} M_i(\theta + \epsilon)$. Under the assumption that $M_i(\theta)$ is twice differentiable and $\rho$ is sufficiently small, the Anchor Gap $\Gamma_i(\theta)$ is approximated b

Figures (3)

  • Figure 1: Comparison of Implicit Reward Landscapes and Objectives.(a) Presents the main objective functions. (b-d) Illustrate the reward dynamics of each method. Note that the entire visualization has been rescaled for better fit.
  • Figure 2: Wall-clock training dynamics on Mistral-7B. All methods are trained for a single epoch. GAPO continues to improve during later optimization stages, while SimPO saturates after its standard training regime.
  • Figure 3: Comparison of model outputs on historical facts. The loser model (left) generates a hallucination regarding the first settlement date and location (1597 vs. 1497), whereas the winner model (right) provides historically accurate details.

Theorems & Definitions (3)

  • Theorem 5.1: Anchor Gap as Local Sharpness
  • proof
  • proof