Table of Contents
Fetching ...

Two-Stage Optimizer-Aware Online Data Selection for Large Language Models

Fangxin Wang, Peyman Baghershahi, Langzhou He, Henry Peng Zou, Sourav Medya, Philip S. Yu

Abstract

Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

Two-Stage Optimizer-Aware Online Data Selection for Large Language Models

Abstract

Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

Paper Structure

This paper contains 36 sections, 1 theorem, 45 equations, 2 figures, 6 tables, 2 algorithms.

Key Result

Theorem B.1

Let $v \in \mathbb{R}^d$ be fixed, and let $h:\mathcal{W}\to\mathbb{R}^d$ be any mapping from a feasible set $\mathcal{W}$ to $\mathbb{R}^d$.

Figures (2)

  • Figure 1: TyDiQA performance (F1) as a function of the training data ratio.
  • Figure 2: Preliminary ablation (single seed) used to prune the design space of our proposed framework.

Theorems & Definitions (2)

  • Theorem B.1: Inner-product and cosine maximization as squared $\ell_2$ matching
  • proof