TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

Ke Yang; Volodymyr Kindratenko; ChengXiang Zhai

TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

Ke Yang, Volodymyr Kindratenko, ChengXiang Zhai

TL;DR

This work tackles the high cost of training language models by proposing a simplified language environment and a leaner data pipeline. It introduces TinyHelen and a suite of leaner datasets: Leaner-Pretrain ($71 ext{M}$ tokens), Leaner-Instruct ($7 ext{M}$), Leaner-Glue, and Leaner-Eval ($1{,}594$ questions) to train and evaluate tiny LMs with vocabulary limited to $2{,}000$ tokens. The study demonstrates that leaner pre-training improves learning efficiency and instruction-following for tiny models, and shows curriculum-learning strategies can further reduce pre-training steps with tiny proxies, suggesting possible transfer to larger models. The work provides a cost-effective testbed for self-evolving agents, and shares code and data at GitHub.

Abstract

Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.

TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

TL;DR

tokens), Leaner-Instruct (

), Leaner-Glue, and Leaner-Eval (

questions) to train and evaluate tiny LMs with vocabulary limited to

tokens. The study demonstrates that leaner pre-training improves learning efficiency and instruction-following for tiny models, and shows curriculum-learning strategies can further reduce pre-training steps with tiny proxies, suggesting possible transfer to larger models. The work provides a cost-effective testbed for self-evolving agents, and shares code and data at GitHub.

Abstract

Paper Structure (129 sections, 13 equations, 2 figures, 9 tables, 3 algorithms)

This paper contains 129 sections, 13 equations, 2 figures, 9 tables, 3 algorithms.

Introduction
Related Work
Text-based Self-evolving Agents with Curriculum Learning
Tiny Datasets for Tiny Language Models
Model Architecture Comparison and Curriculum Learning Strategies for Pre-training
Dataset Curation and Statistics
Leaner-Pretrain and Leaner-Instruct
Data Sources
Language Simplification
Leaner-Glue
Leaner-Eval
Experiments
Exp1: Comparing Model Architectures with the Leaner Dataset Suite
Models
Benchmark
...and 114 more sections

Figures (2)

Figure 1: Twin samples of the original and the Leaner dataset.
Figure 2: The average performance score on downstream tasks of proxy models pre-trained with both vanilla and varying curriculum learning strategies. Models are tested and their results plotted at 500-step intervals. The figures are based on the data in Table \ref{['table:exp3-curriculum-learning-comparison']}.

Theorems & Definitions (3)

proof
proof
proof

TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

TL;DR

Abstract

TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (3)