Table of Contents
Fetching ...

Super Tiny Language Models

Dylan Hillier, Leon Guertler, Cheston Tan, Palaash Agrawal, Chen Ruirui, Bobby Cheng

TL;DR

Large LLMs impose substantial computational and energy costs, limiting accessibility and experimentation. The paper proposes Super Tiny Language Models (STLMs) in the 10–100M parameter range, leveraging techniques such as tokenizer-free designs via byte-level pooling, weight tying, and data-efficient training to sustain competitive performance. It outlines a public PyTorch-based research pipeline, benchmark plans, and a suite of subprojects (tokenizer-free models, self-play, alternative objectives) to systematically reduce parameters while preserving utility. If realized, STLMs could democratize NLP research, reduce environmental impact, and enable rapid, on-demand experimentation on commodity hardware.

Abstract

The rapid advancement of large language models (LLMs) has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. These methods aim to significantly reduce reduce the parameter count compared to traditional models -- in future works, we aim to build on these in a way that maintains and improves upon the performance of base transformer models. This series of papers will explore into various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives. We will target models with 10M, 50M, and 100M parameters. Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.

Super Tiny Language Models

TL;DR

Large LLMs impose substantial computational and energy costs, limiting accessibility and experimentation. The paper proposes Super Tiny Language Models (STLMs) in the 10–100M parameter range, leveraging techniques such as tokenizer-free designs via byte-level pooling, weight tying, and data-efficient training to sustain competitive performance. It outlines a public PyTorch-based research pipeline, benchmark plans, and a suite of subprojects (tokenizer-free models, self-play, alternative objectives) to systematically reduce parameters while preserving utility. If realized, STLMs could democratize NLP research, reduce environmental impact, and enable rapid, on-demand experimentation on commodity hardware.

Abstract

The rapid advancement of large language models (LLMs) has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. These methods aim to significantly reduce reduce the parameter count compared to traditional models -- in future works, we aim to build on these in a way that maintains and improves upon the performance of base transformer models. This series of papers will explore into various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives. We will target models with 10M, 50M, and 100M parameters. Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
Paper Structure (37 sections, 6 equations, 2 figures, 2 tables)

This paper contains 37 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Perplexity during training for the baseline models
  • Figure 2: Diagram demonstrating flow of information through transformer components during training