Table of Contents
Fetching ...

Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach

Miles Q. Li, Benjamin C. M. Fung, Shih-Chia Huang

TL;DR

The paper examines training dynamics and practicalities in pretraining a 1.7B LLaMa-based model (DMaS-LLaMa-Lite) on a carefully curated ~20B token corpus, emphasizing data quality, optimizer continuity, and hardware transitions. It combines a data-efficient pretraining approach with post-training instruction tuning using LoRA adapters, demonstrating that high-quality data can outperform larger token counts from less curated sources. Key findings include the necessity of restoring optimizer states to avoid loss spikes, the impact of hardware transitions on stability, and substantial qualitative gains from instruction tuning, even with a relatively small fine-tuning budget. The work provides actionable guidance and artifacts for reproducibility, offering a practical blueprint for researchers and practitioners aiming to optimize pretraining and refinement of LLMs under realistic resource constraints.

Abstract

Pretraining large language models is a complex endeavor influenced by multiple factors, including model architecture, data quality, training continuity, and hardware constraints. In this paper, we share insights gained from the experience of training DMaS-LLaMa-Lite, a fully open source, 1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of carefully curated data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. Beyond pretraining, we extend our analysis to include a post-training phase focused on instruction tuning, where the model was refined to produce more contextually appropriate, user-aligned responses. We highlight practical considerations such as the importance of restoring optimizer states when resuming from checkpoints, and the impact of hardware changes on training stability and throughput. While qualitative evaluation provides an intuitive understanding of model improvements, our analysis extends to various performance benchmarks, demonstrating how high-quality data and thoughtful scaling enable competitive results with significantly fewer training tokens. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies. The training script is available on Github at https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code. The model checkpoints are available on Huggingface at https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb.

Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach

TL;DR

The paper examines training dynamics and practicalities in pretraining a 1.7B LLaMa-based model (DMaS-LLaMa-Lite) on a carefully curated ~20B token corpus, emphasizing data quality, optimizer continuity, and hardware transitions. It combines a data-efficient pretraining approach with post-training instruction tuning using LoRA adapters, demonstrating that high-quality data can outperform larger token counts from less curated sources. Key findings include the necessity of restoring optimizer states to avoid loss spikes, the impact of hardware transitions on stability, and substantial qualitative gains from instruction tuning, even with a relatively small fine-tuning budget. The work provides actionable guidance and artifacts for reproducibility, offering a practical blueprint for researchers and practitioners aiming to optimize pretraining and refinement of LLMs under realistic resource constraints.

Abstract

Pretraining large language models is a complex endeavor influenced by multiple factors, including model architecture, data quality, training continuity, and hardware constraints. In this paper, we share insights gained from the experience of training DMaS-LLaMa-Lite, a fully open source, 1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of carefully curated data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. Beyond pretraining, we extend our analysis to include a post-training phase focused on instruction tuning, where the model was refined to produce more contextually appropriate, user-aligned responses. We highlight practical considerations such as the importance of restoring optimizer states when resuming from checkpoints, and the impact of hardware changes on training stability and throughput. While qualitative evaluation provides an intuitive understanding of model improvements, our analysis extends to various performance benchmarks, demonstrating how high-quality data and thoughtful scaling enable competitive results with significantly fewer training tokens. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies. The training script is available on Github at https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code. The model checkpoints are available on Huggingface at https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb.

Paper Structure

This paper contains 17 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Training logs visualizing training and validation loss, Hella accuracy, learning rate decay, norm behavior, and tokens processed per second over the course of 40,000+ steps.
  • Figure 2: Performance comparison of DMaS-LLaMa-Lite checkpoints and TinyLLaMa (2T) across various benchmarks. Solid lines represent DMaS-LLaMa-Lite performance at different training steps, while horizontal dotted lines indicate TinyLLaMa 2T results.