Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li
TL;DR
This work tackles the challenge of predicting downstream task performance during LLM pretraining in the presence of emergence and task-difficulty heterogeneity. It introduces Clustering-On-Difficulty (COD), a multi-stage framework that forms difficulty-based task clusters, fits cluster-specific scaling laws, extrapolates to larger models within extrapolatable clusters, and maps predictions to the full task set. Empirical results on a 70B-parameter model across eight benchmarks show COD achieving around 1.63% average prediction error, substantially outperforming prior loss-intermediate and end-to-end methods. The framework enables more accurate pretraining resource allocation and monitoring by leveraging a predictable subset and a robust subset-to-full mapping, while acknowledging limitations related to MoE architectures and dataset properties for future work.
Abstract
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for efficient resource allocation. This is challenged by: 1) the emergence phenomenon, where metrics become meaningful only after extensive training, hindering prediction by smaller models; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby establishing a more stable and predictable support subset through the exclusion of tasks exhibiting non-emergent behavior or irregular scaling. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.36% average prediction error across eight key LLM benchmarks, offering actionable insights for resource allocation and training monitoring of LLMs pretraining.
