Table of Contents
Fetching ...

Small Language Models: Survey, Measurements, and Insights

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, Mengwei Xu

TL;DR

The paper surveys small language models (100M–5B) focusing on decoder-only architectures and open-weight access, benchmarking 70 models across architectures, training data, training methods, and on-device cost. It provides a comprehensive evaluation of capabilities (commonsense, math, in-context learning, long-context retrieval) and on-device latency/memory on edge hardware, highlighting the impact of quantization and hardware. Key findings show architectural innovations in SLMs are modest, but data quality and curated pretraining datasets drive performance, with in-context learning improving as models scale and long-context capabilities improving with size. The authors propose directions spanning hardware–algorithm co-design, synthetic data pipelines, deployment-aware scaling, continual on-device learning, device–cloud collaboration, fair benchmarking, and sparsity exploration to advance SLM research and deployment.

Abstract

Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 70 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, mathematics, in-context learning, and long context. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.

Small Language Models: Survey, Measurements, and Insights

TL;DR

The paper surveys small language models (100M–5B) focusing on decoder-only architectures and open-weight access, benchmarking 70 models across architectures, training data, training methods, and on-device cost. It provides a comprehensive evaluation of capabilities (commonsense, math, in-context learning, long-context retrieval) and on-device latency/memory on edge hardware, highlighting the impact of quantization and hardware. Key findings show architectural innovations in SLMs are modest, but data quality and curated pretraining datasets drive performance, with in-context learning improving as models scale and long-context capabilities improving with size. The authors propose directions spanning hardware–algorithm co-design, synthetic data pipelines, deployment-aware scaling, continual on-device learning, device–cloud collaboration, fair benchmarking, and sparsity exploration to advance SLM research and deployment.

Abstract

Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 70 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, mathematics, in-context learning, and long context. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.
Paper Structure (25 sections, 24 figures, 3 tables)

This paper contains 25 sections, 24 figures, 3 tables.

Figures (24)

  • Figure 1: An overview of SLMs. * indicates the models are not open-sourced so will not be benchmarked.
  • Figure 2: The architecture.
  • Figure 3: Architecture distribution.
  • Figure 5: The usage frequency of each open-source pre-training dataset from 2022 to 2024
  • Figure 6: The relationship between Training Tokens and Parameters.
  • ...and 19 more figures

Theorems & Definitions (11)

  • remark 1
  • remark 2
  • remark 3
  • remark 4
  • remark 5
  • remark 6
  • remark 7
  • remark 8
  • remark 9
  • remark 10
  • ...and 1 more