GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining
Simin Fan, Maria Ios Glarou, Martin Jaggi
TL;DR
GRAPE tackles robustness across multiple targets by jointly reweighting source-domain data and target-task importance through a minimax framework grounded in Group DRO. It introduces RoI to quantify per-task learning progress and derives online mirror-descent style update rules for both task and domain weights, with convergence guarantees to a Pareto-neighborhood at ${\mathcal{O}}(1/T)$. Empirically, GRAPE improves multi-task reasoning benchmarks on ClimbLab/SlimPajama and enhances low-resource language modeling on Wiki-40B, often using significantly fewer tokens than baselines. These results demonstrate the value of an adaptive curriculum over fixed data mixtures, highlighting gains in generalization and cross-lingual transfer for large-scale pretraining.
Abstract
The performance of large language models (LLMs) across diverse downstream applications is fundamentally governed by the quality and composition of their pretraining corpora. Existing domain reweighting algorithms primarily optimize data mixtures for a single target task, thereby resulting in models that overfit to specialized objectives while exhibiting substantial performance degradation on other benchmarks. This paper introduces Group Robust Multi-target Adaptive PrEtraining (GRAPE), a novel multi-source-multi-target domain reweighting framework designed to calibrate pretraining data mixtures for robust performance across multiple target tasks simultaneously. GRAPE dynamically adjusts sampling weights across source domains (domain weights) while concurrently modulating task weights that quantify the relative importance of each individual target task. This adaptive process prioritizes tasks based on their learning difficulty throughout training. We formulate this interleaved reweighting mechanism as a minimax optimization problem: The inner maximization adjusts task weights leveraging group distributed-robust-optimization (DRO), where those tasks demonstrating the least improvement under the current data mixture are prioritized with higher weights; The outer minimization then optimizes domain weights to maximize loss reduction on the prioritized tasks. Experiments on ClimbLab and SlimPajama datasets demonstrate that GRAPE consistently outperforms baseline methods in terms of reasoning performance across 6 benchmarks. Furthermore, when applied to multilingual targets, GRAPE effectively identifies optimal training mixtures from mainstream languages, achieving superior language modeling capabilities across 8 low-resource target languages.
