Table of Contents
Fetching ...

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li

TL;DR

This work tackles the challenge of predicting downstream task performance during LLM pretraining in the presence of emergence and task-difficulty heterogeneity. It introduces Clustering-On-Difficulty (COD), a multi-stage framework that forms difficulty-based task clusters, fits cluster-specific scaling laws, extrapolates to larger models within extrapolatable clusters, and maps predictions to the full task set. Empirical results on a 70B-parameter model across eight benchmarks show COD achieving around 1.63% average prediction error, substantially outperforming prior loss-intermediate and end-to-end methods. The framework enables more accurate pretraining resource allocation and monitoring by leveraging a predictable subset and a robust subset-to-full mapping, while acknowledging limitations related to MoE architectures and dataset properties for future work.

Abstract

The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for efficient resource allocation. This is challenged by: 1) the emergence phenomenon, where metrics become meaningful only after extensive training, hindering prediction by smaller models; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby establishing a more stable and predictable support subset through the exclusion of tasks exhibiting non-emergent behavior or irregular scaling. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.36% average prediction error across eight key LLM benchmarks, offering actionable insights for resource allocation and training monitoring of LLMs pretraining.

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

TL;DR

This work tackles the challenge of predicting downstream task performance during LLM pretraining in the presence of emergence and task-difficulty heterogeneity. It introduces Clustering-On-Difficulty (COD), a multi-stage framework that forms difficulty-based task clusters, fits cluster-specific scaling laws, extrapolates to larger models within extrapolatable clusters, and maps predictions to the full task set. Empirical results on a 70B-parameter model across eight benchmarks show COD achieving around 1.63% average prediction error, substantially outperforming prior loss-intermediate and end-to-end methods. The framework enables more accurate pretraining resource allocation and monitoring by leveraging a predictable subset and a robust subset-to-full mapping, while acknowledging limitations related to MoE architectures and dataset properties for future work.

Abstract

The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for efficient resource allocation. This is challenged by: 1) the emergence phenomenon, where metrics become meaningful only after extensive training, hindering prediction by smaller models; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby establishing a more stable and predictable support subset through the exclusion of tasks exhibiting non-emergent behavior or irregular scaling. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.36% average prediction error across eight key LLM benchmarks, offering actionable insights for resource allocation and training monitoring of LLMs pretraining.

Paper Structure

This paper contains 33 sections, 3 theorems, 21 equations, 6 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Consider a language model $M_C$ trained with compute budget $C$ and a set of downstream tasks $\mathcal{P}$. Under the following assumptions: Assumption 1 (Power-law scaling of answer loss): the expected answer loss follows: where $\alpha, \beta, \gamma > 0$ are task-specific constants, with $\gamma$ representing the irreducible loss. Assumption 2 (Unique deterministic answers): Each question has

Figures (6)

  • Figure 1: Performance-loss relationship across different model sizes (left) and learning rate schedules (middle). Performance-compute relationship for different clusters of the BBH samples(right)
  • Figure 2: The pipeline of Cluster-On-Difficulty downstream task performance scaling, including 4 stages: a. Represent task difficulty feature with task-wise passrate vector. Cluster on the difficulty feature and filter outliers. b. Fit cluster-wise performance-compute curve. Classify clusters into extrapolatable clusters, non-extrapolatable clusters, and non-emergent clusters. c. Predict accuracy on extrapolatable clusters. d. Map subset accuracy prediction to full evaluation set performance.
  • Figure 3: t-SNE visualization of different clustering methods: DBSCAN(left), MeanShift(Middle), Improved-MeanShift(Right). Each point represents an evaluation sample.
  • Figure 4: Performance-compute relationship for different prediction methods on eight evaluation sets.
  • Figure A1: Performance mapping with different interpolation methods on the BBH evaluation set. The cubic spline is overfitted, and the cubic polynomial method is underfitted. Quartic polynomials and quintic polynomials are comparable, while a quartic polynomial has fewer parameters.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1: Scaling Law for Downstream Task Performance
  • proof : Proof Sketch
  • Lemma 1: Arithmetic-geometric mean difference
  • proof
  • Theorem A1: Scaling Law for Downstream Task Performance
  • proof