Table of Contents
Fetching ...

Emergent Abilities in Large Language Models: A Survey

Leonardo Berti, Flavio Giorgi, Gjergji Kasneci

TL;DR

This survey analyzes emergent abilities in large language models, tracing how scale, training dynamics, and prompting shape abrupt, task-specific capabilities whose existence and predictability are debated. It synthesizes evidence on metrics, loss dynamics, quantization, task complexity, and implicit representations, while documenting the rise of Large Reasoning Models and LLM-powered agents. The authors propose a taxonomy to organize origins, manifestations, and mitigation strategies, and highlight significant safety and governance implications of emergent, potentially deceptive or manipulative behaviors. They argue for more robust evaluation frameworks and targeted research into prediction, mitigation, and responsible deployment of increasingly capable AI systems.

Abstract

Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence. The scaling of these models, accomplished by increasing the number of parameters and the magnitude of the training datasets, has been linked to various so-called emergent abilities that were previously unobserved. These emergent abilities, ranging from advanced reasoning and in-context learning to coding and problem-solving, have sparked an intense scientific debate: Are they truly emergent, or do they simply depend on external factors, such as training dynamics, the type of problems, or the chosen metric? What underlying mechanism causes them? Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications. In this work, we shed light on emergent abilities by conducting a comprehensive review of the phenomenon, addressing both its scientific underpinnings and real-world consequences. We first critically analyze existing definitions, exposing inconsistencies in conceptualizing emergent abilities. We then explore the conditions under which these abilities appear, evaluating the role of scaling laws, task complexity, pre-training loss, quantization, and prompting strategies. Our review extends beyond traditional LLMs and includes Large Reasoning Models (LRMs), which leverage reinforcement learning and inference-time search to amplify reasoning and self-reflection. However, emergence is not inherently positive. As AI systems gain autonomous reasoning capabilities, they also develop harmful behaviors, including deception, manipulation, and reward hacking. We highlight growing concerns about safety and governance, emphasizing the need for better evaluation frameworks and regulatory oversight.

Emergent Abilities in Large Language Models: A Survey

TL;DR

This survey analyzes emergent abilities in large language models, tracing how scale, training dynamics, and prompting shape abrupt, task-specific capabilities whose existence and predictability are debated. It synthesizes evidence on metrics, loss dynamics, quantization, task complexity, and implicit representations, while documenting the rise of Large Reasoning Models and LLM-powered agents. The authors propose a taxonomy to organize origins, manifestations, and mitigation strategies, and highlight significant safety and governance implications of emergent, potentially deceptive or manipulative behaviors. They argue for more robust evaluation frameworks and targeted research into prediction, mitigation, and responsible deployment of increasingly capable AI systems.

Abstract

Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence. The scaling of these models, accomplished by increasing the number of parameters and the magnitude of the training datasets, has been linked to various so-called emergent abilities that were previously unobserved. These emergent abilities, ranging from advanced reasoning and in-context learning to coding and problem-solving, have sparked an intense scientific debate: Are they truly emergent, or do they simply depend on external factors, such as training dynamics, the type of problems, or the chosen metric? What underlying mechanism causes them? Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications. In this work, we shed light on emergent abilities by conducting a comprehensive review of the phenomenon, addressing both its scientific underpinnings and real-world consequences. We first critically analyze existing definitions, exposing inconsistencies in conceptualizing emergent abilities. We then explore the conditions under which these abilities appear, evaluating the role of scaling laws, task complexity, pre-training loss, quantization, and prompting strategies. Our review extends beyond traditional LLMs and includes Large Reasoning Models (LRMs), which leverage reinforcement learning and inference-time search to amplify reasoning and self-reflection. However, emergence is not inherently positive. As AI systems gain autonomous reasoning capabilities, they also develop harmful behaviors, including deception, manipulation, and reward hacking. We highlight growing concerns about safety and governance, emphasizing the need for better evaluation frameworks and regulatory oversight.

Paper Structure

This paper contains 21 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of Emergent Abilities in Large Language Models
  • Figure 2: Reproduced from schaeffer2024emergent (standard deviations on the curves could not be reproduced due to missing data). Original caption: Claimed emergent abilities evaporate upon changing the metric. Left to Right: Mathematical Model, 2-Integer 2-Digit Multiplication Task, 2-Integer 4-Digit Addition Task. Top: When performance is measured by a nonlinear metric (e.g., Accuracy), the InstructGPT/GPT-3 brown2020language family's performance appears sharp and unpredictable on longer target lengths. Bottom: When performance is instead measured by a linear metric (e.g., Token Edit Distance), the family exhibits smooth, predictable performance improvements.
  • Figure 3: Reproduced from schaeffer2024emergent (standard deviations on the curves could not be reproduced due to missing data). Original caption: Claimed emergent abilities evaporate upon using better statistics. Left to Right: Mathematical Model, 2-Integer 2-Digit Multiplication Task, 2-Integer 4-Digit Addition Task. Based on the predictable effect Accuracy has on performance, measuring performance requires high resolution. Generating additional test data increases the resolution and reveals that even on Accuracy, the InstructGPT/GPT-3 family's brown2020language performance is above chance and improves in a smooth, continuous, predictable manner that qualitatively matches the mathematical model.
  • Figure 4: Reproduced from wu2024u. Original Caption: U-Shaped and inverted-U scaling with MMLU’s questions clustered into 10 groups. Higher levels are harder questions.