U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Tung-Yu Wu; Pei-Yu Lo

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Tung-Yu Wu, Pei-Yu Lo

TL;DR

This work analyzes how large language models exhibit emergent abilities by grouping questions by difficulty and observing distinct scaling trends: inverted-U for easy questions and U-shaped for hard ones. The authors introduce the Target-Conditioned (TC) Brier Score to measure performance continuously and define an emergence threshold $T$ where sharp improvements occur. They propose Slice-and-Sandwich, a pipeline that fits separate easy and hard scaling trends below $T$, averages them, and maps the forecast back to traditional accuracy metrics to predict post-threshold performance. Across multiple benchmarks, this approach forecasts emergent behavior more accurately than sigmoid-based baselines and offers an explainable framework for anticipating sharp performance increases. The method provides practical value for monitoring and forecasting LLM capabilities, with potential applications in deployment planning and safety assessment.

Abstract

Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where model performance stagnates at first and then improves sharply and unpredictably with scale beyond a threshold. In this work, we investigate the phenomenon by grouping questions based on difficulty level and provide a possible explanation for emergent abilities. Specifically, we observe U-shaped scaling for hard questions and inverted-U scaling followed by steady improvement for easy questions. The two scaling patterns initially offset each other, causing stagnant overall performance. The performance starts to soar when the scaling pattern of easy questions reverts from inverse to standard scaling, leading to emergent abilities. Based on this finding, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict the emergence threshold and model performance beyond the threshold. Our code is publicly available at https://github.com/tony10101105/ExpEmergence.

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

TL;DR

where sharp improvements occur. They propose Slice-and-Sandwich, a pipeline that fits separate easy and hard scaling trends below

, averages them, and maps the forecast back to traditional accuracy metrics to predict post-threshold performance. Across multiple benchmarks, this approach forecasts emergent behavior more accurately than sigmoid-based baselines and offers an explainable framework for anticipating sharp performance increases. The method provides practical value for monitoring and forecasting LLM capabilities, with potential applications in deployment planning and safety assessment.

Abstract

Paper Structure (42 sections, 8 equations, 27 figures, 4 tables)

This paper contains 42 sections, 8 equations, 27 figures, 4 tables.

Introduction
Scaling Trend by Difficulty Level: U-shape vs. Inverted-U
Terminology
Log Compute and Emergence Threshold
Continuous Performance Metrics
Grouping Questions by Difficulty Levels
Measuring Question Difficulty Level
Question Sorting and Grouping
U-Shaped and Inverted-U Scaling
Possible Explanation for U-shaped and Inverted-U Scaling
Scaling Trend of Easy Question Groups
Scaling Trend on Hard Question Group
Slice-and-Sandwich
Problem Formulation
Pipeline Overview
...and 27 more sections

Figures (27)

Figure 1: The accuracy, Target-Conditioned (TC) Brier Score, U-shaped and inverted-U scaling on the MMLU benchmark hendryckstest2021 evaluated using 56 LLMs. Sec. \ref{['sec2-subsub: continuous metric']} provides details on the TC Brier Score, which captures granular changes in model performance. App. \ref{['sup: implementation details']} provides details of the 56 LLMs.
Figure 2: The accuracy, TC Brier Score, U-Shaped and inverted-U scaling on the Persian-QA dataset in BIG-bench srivastava2023beyond.
Figure 3: The accuracy, TC Brier Score, U-shaped and inverted-U scaling on the arithmetic dataset in BIG-bench srivastava2023beyond.
Figure 4: U-shaped and inverted-U scaling on 6 datasets exhibiting emergent abilities, with group number $G=3$. Except for MMLU, the other 5 tasks are from Big-Bench. Different levels of U-shaped and inverted-U scaling trends are demonstrated across the 6 tasks.
Figure 5: Illustration of deep double descent nakkiran2021deep on easy question groups and U-shaped scaling wei-etal-2023-inverse on hard question groups under the TC Brier Score.
...and 22 more figures

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

TL;DR

Abstract

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (27)