Table of Contents
Fetching ...

Predictable Emergent Abilities of LLMs: Proxy Tasks Are All You Need

Bo-Wen Zhang, Yan Yan, Boxiang Yang, Yifei Xue, Guang Liu

TL;DR

This paper tackles the problem that scaling laws fail to predict emergent abilities in large language models. It proposes a two-stage proxy-task framework: first, select proxy tasks using relevance and robustness metrics derived from multiple models and small ensembles; second, predict the target task performance by aggregating proxy-task results with a weighted, transform-based scheme. The approach is validated on tool-use capabilities with a 42-task candidate pool and 17 model pairs, showing strong correlations between proxy-based predictions and actual $T$-eval performance, and demonstrating robustness to training uncertainties and data choices. The findings suggest that early-stage proxy evaluations can reliably forecast complex abilities and inform training configuration decisions, offering a practical path to more efficient LLM development. The methodology emphasizes the integration of task relevance, robustness to data and initialization, and careful aggregation to yield actionable predictions for emergent capabilities.

Abstract

While scaling laws optimize training configurations for large language models (LLMs) through experiments on smaller or early-stage models, they fail to predict emergent abilities due to the absence of such capabilities in these models. To address this, we propose a method that predicts emergent abilities by leveraging proxy tasks. We begin by establishing relevance metrics between the target task and candidate tasks based on performance differences across multiple models. These candidate tasks are then validated for robustness with small model ensembles, leading to the selection of the most appropriate proxy tasks. The predicted performance on the target task is then derived by integrating the evaluation results of these proxies. In a case study on tool utilization capabilities, our method demonstrated a strong correlation between predicted and actual performance, confirming its effectiveness.

Predictable Emergent Abilities of LLMs: Proxy Tasks Are All You Need

TL;DR

This paper tackles the problem that scaling laws fail to predict emergent abilities in large language models. It proposes a two-stage proxy-task framework: first, select proxy tasks using relevance and robustness metrics derived from multiple models and small ensembles; second, predict the target task performance by aggregating proxy-task results with a weighted, transform-based scheme. The approach is validated on tool-use capabilities with a 42-task candidate pool and 17 model pairs, showing strong correlations between proxy-based predictions and actual -eval performance, and demonstrating robustness to training uncertainties and data choices. The findings suggest that early-stage proxy evaluations can reliably forecast complex abilities and inform training configuration decisions, offering a practical path to more efficient LLM development. The methodology emphasizes the integration of task relevance, robustness to data and initialization, and careful aggregation to yield actionable predictions for emergent capabilities.

Abstract

While scaling laws optimize training configurations for large language models (LLMs) through experiments on smaller or early-stage models, they fail to predict emergent abilities due to the absence of such capabilities in these models. To address this, we propose a method that predicts emergent abilities by leveraging proxy tasks. We begin by establishing relevance metrics between the target task and candidate tasks based on performance differences across multiple models. These candidate tasks are then validated for robustness with small model ensembles, leading to the selection of the most appropriate proxy tasks. The predicted performance on the target task is then derived by integrating the evaluation results of these proxies. In a case study on tool utilization capabilities, our method demonstrated a strong correlation between predicted and actual performance, confirming its effectiveness.

Paper Structure

This paper contains 11 sections, 13 equations, 1 figure, 10 tables, 1 algorithm.

Figures (1)

  • Figure 1: Task relevance measured using Pearson, Spearman, and Kendall correlation metrics.