Table of Contents
Fetching ...

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Calvin Tan, Jerome Wang

TL;DR

This work argues that data quality, not sheer quantity, can dramatically reduce LLM pre-training time and resources. By curating a 57B-token, textbook-like corpus and coupling it with architectural and alignment optimizations, the authors train a 1.56B-parameter model (1.5-Pints) in 9 days and achieve strong instruction-following performance on MT-Bench with far less data than peers. Key innovations include the use of a modified Mistral tokenizer, padding and chat-template tokens, Grouped Query Attention, and DPO-based alignment, complemented by targeted fine-tuning on diverse instruction datasets. The results suggest practical pathways toward more accessible, environmentally friendly LLM development, with open-source resources to foster further research and reproducibility.

Abstract

This paper presents a compute-efficient approach to pre-training a Language Model-the "1.5-Pints"-in only 9 days, while outperforming state-of-the-art models as an instruction-following assistant.Based on MT-Bench (a benchmark that emulates human judgments), 1.5-Pints outperforms Apple's OpenELM and Microsoft's Phi.This is achieved by a carefully curated pre-training dataset of 57 billion tokens, using a mix of automated workflows and manual human review. The selection of the dataset prioritizes content that is considered expository and "textbook-like" to aid the model in reasoning and logical deduction, culminating in its overall ability as a strong and versatile AI model. In terms of the model architecture, we employed a modified Mistral tokenizer, alongside a Llama-2 architecture for wider compatibility. For training, we adopted the methodologies used by StableLM, TinyLlama, and Huggingface Zephyr. 1.5-Pints demonstrates that by focusing on data quality over quantity in LLM training, we can significantly reduce training time and resources required. We believe this approach will not only make pre-training more accessible but also reduce our carbon footprint. Our findings and resources from this research are open-sourced, aiming to facilitate further advancements in the field. The 1.5-Pints model is available in two versions: 2K and 16K context windows.

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

TL;DR

This work argues that data quality, not sheer quantity, can dramatically reduce LLM pre-training time and resources. By curating a 57B-token, textbook-like corpus and coupling it with architectural and alignment optimizations, the authors train a 1.56B-parameter model (1.5-Pints) in 9 days and achieve strong instruction-following performance on MT-Bench with far less data than peers. Key innovations include the use of a modified Mistral tokenizer, padding and chat-template tokens, Grouped Query Attention, and DPO-based alignment, complemented by targeted fine-tuning on diverse instruction datasets. The results suggest practical pathways toward more accessible, environmentally friendly LLM development, with open-source resources to foster further research and reproducibility.

Abstract

This paper presents a compute-efficient approach to pre-training a Language Model-the "1.5-Pints"-in only 9 days, while outperforming state-of-the-art models as an instruction-following assistant.Based on MT-Bench (a benchmark that emulates human judgments), 1.5-Pints outperforms Apple's OpenELM and Microsoft's Phi.This is achieved by a carefully curated pre-training dataset of 57 billion tokens, using a mix of automated workflows and manual human review. The selection of the dataset prioritizes content that is considered expository and "textbook-like" to aid the model in reasoning and logical deduction, culminating in its overall ability as a strong and versatile AI model. In terms of the model architecture, we employed a modified Mistral tokenizer, alongside a Llama-2 architecture for wider compatibility. For training, we adopted the methodologies used by StableLM, TinyLlama, and Huggingface Zephyr. 1.5-Pints demonstrates that by focusing on data quality over quantity in LLM training, we can significantly reduce training time and resources required. We believe this approach will not only make pre-training more accessible but also reduce our carbon footprint. Our findings and resources from this research are open-sourced, aiming to facilitate further advancements in the field. The 1.5-Pints model is available in two versions: 2K and 16K context windows.
Paper Structure (46 sections, 3 equations, 3 figures, 15 tables)

This paper contains 46 sections, 3 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Growth in Pre-Training Corpus
  • Figure 2: Pre-Training Corpus Comparison
  • Figure 3: Impact of training modality on performance