TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment
Ke Yang, Volodymyr Kindratenko, ChengXiang Zhai
TL;DR
This work tackles the high cost of training language models by proposing a simplified language environment and a leaner data pipeline. It introduces TinyHelen and a suite of leaner datasets: Leaner-Pretrain ($71 ext{M}$ tokens), Leaner-Instruct ($7 ext{M}$), Leaner-Glue, and Leaner-Eval ($1{,}594$ questions) to train and evaluate tiny LMs with vocabulary limited to $2{,}000$ tokens. The study demonstrates that leaner pre-training improves learning efficiency and instruction-following for tiny models, and shows curriculum-learning strategies can further reduce pre-training steps with tiny proxies, suggesting possible transfer to larger models. The work provides a cost-effective testbed for self-evolving agents, and shares code and data at GitHub.
Abstract
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.
