Table of Contents
Fetching ...

GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining

Simin Fan, Maria Ios Glarou, Martin Jaggi

TL;DR

GRAPE tackles robustness across multiple targets by jointly reweighting source-domain data and target-task importance through a minimax framework grounded in Group DRO. It introduces RoI to quantify per-task learning progress and derives online mirror-descent style update rules for both task and domain weights, with convergence guarantees to a Pareto-neighborhood at ${\mathcal{O}}(1/T)$. Empirically, GRAPE improves multi-task reasoning benchmarks on ClimbLab/SlimPajama and enhances low-resource language modeling on Wiki-40B, often using significantly fewer tokens than baselines. These results demonstrate the value of an adaptive curriculum over fixed data mixtures, highlighting gains in generalization and cross-lingual transfer for large-scale pretraining.

Abstract

The performance of large language models (LLMs) across diverse downstream applications is fundamentally governed by the quality and composition of their pretraining corpora. Existing domain reweighting algorithms primarily optimize data mixtures for a single target task, thereby resulting in models that overfit to specialized objectives while exhibiting substantial performance degradation on other benchmarks. This paper introduces Group Robust Multi-target Adaptive PrEtraining (GRAPE), a novel multi-source-multi-target domain reweighting framework designed to calibrate pretraining data mixtures for robust performance across multiple target tasks simultaneously. GRAPE dynamically adjusts sampling weights across source domains (domain weights) while concurrently modulating task weights that quantify the relative importance of each individual target task. This adaptive process prioritizes tasks based on their learning difficulty throughout training. We formulate this interleaved reweighting mechanism as a minimax optimization problem: The inner maximization adjusts task weights leveraging group distributed-robust-optimization (DRO), where those tasks demonstrating the least improvement under the current data mixture are prioritized with higher weights; The outer minimization then optimizes domain weights to maximize loss reduction on the prioritized tasks. Experiments on ClimbLab and SlimPajama datasets demonstrate that GRAPE consistently outperforms baseline methods in terms of reasoning performance across 6 benchmarks. Furthermore, when applied to multilingual targets, GRAPE effectively identifies optimal training mixtures from mainstream languages, achieving superior language modeling capabilities across 8 low-resource target languages.

GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining

TL;DR

GRAPE tackles robustness across multiple targets by jointly reweighting source-domain data and target-task importance through a minimax framework grounded in Group DRO. It introduces RoI to quantify per-task learning progress and derives online mirror-descent style update rules for both task and domain weights, with convergence guarantees to a Pareto-neighborhood at . Empirically, GRAPE improves multi-task reasoning benchmarks on ClimbLab/SlimPajama and enhances low-resource language modeling on Wiki-40B, often using significantly fewer tokens than baselines. These results demonstrate the value of an adaptive curriculum over fixed data mixtures, highlighting gains in generalization and cross-lingual transfer for large-scale pretraining.

Abstract

The performance of large language models (LLMs) across diverse downstream applications is fundamentally governed by the quality and composition of their pretraining corpora. Existing domain reweighting algorithms primarily optimize data mixtures for a single target task, thereby resulting in models that overfit to specialized objectives while exhibiting substantial performance degradation on other benchmarks. This paper introduces Group Robust Multi-target Adaptive PrEtraining (GRAPE), a novel multi-source-multi-target domain reweighting framework designed to calibrate pretraining data mixtures for robust performance across multiple target tasks simultaneously. GRAPE dynamically adjusts sampling weights across source domains (domain weights) while concurrently modulating task weights that quantify the relative importance of each individual target task. This adaptive process prioritizes tasks based on their learning difficulty throughout training. We formulate this interleaved reweighting mechanism as a minimax optimization problem: The inner maximization adjusts task weights leveraging group distributed-robust-optimization (DRO), where those tasks demonstrating the least improvement under the current data mixture are prioritized with higher weights; The outer minimization then optimizes domain weights to maximize loss reduction on the prioritized tasks. Experiments on ClimbLab and SlimPajama datasets demonstrate that GRAPE consistently outperforms baseline methods in terms of reasoning performance across 6 benchmarks. Furthermore, when applied to multilingual targets, GRAPE effectively identifies optimal training mixtures from mainstream languages, achieving superior language modeling capabilities across 8 low-resource target languages.

Paper Structure

This paper contains 60 sections, 4 theorems, 47 equations, 32 figures, 9 tables, 1 algorithm.

Key Result

Theorem 2.1

Let the loss functions $l_n(\boldsymbol{\theta})$ be $L$-smooth for all $n \in [N]$ and the norm of stochastic gradients be upper-bounded by $\mathcal{G}$. If the learning rate $\gamma_t$ satisfies $\gamma_t \leq \frac{1}{L}$ and the regularization parameters $\mu_{\boldsymbol{\alpha}}$, $\mu_{\bold

Figures (32)

  • Figure 1: GRAPE facilitates multi-task reasoning. For 125M models, GRAPE and GRAPE-climbmix greatly outperform five baselines; For larger 0.7B models, GRAPE achieves comparable scores as uniform baseline with 40% fewer tokens.
  • Figure 2: Task weight evolution of GRAPE.
  • Figure 3: Domain weights attributions across 20 clusters in the ClimbLab dataset.
  • Figure 4: Low-resource language learning progress by Log-Perplexity.GRAPE significantly outperforms DoGE and Uniform sampling across all target languages.
  • Figure 5: Weights evolution on multilingual pretraining for low-resource language modeling.
  • ...and 27 more figures

Theorems & Definitions (7)

  • Theorem 2.1: Convergence of GRAPE
  • Theorem 2.2: Monotonic Variance Reduction of Task Performance
  • Theorem C.1: Convergence of GRAPE
  • proof
  • Theorem C.2: Monotonic Variance Reduction of Task Performance
  • proof
  • proof