Table of Contents
Fetching ...

H2O-Danube-1.8B Technical Report

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

TL;DR

H2O-Danube presents a family of open-source, $1.8B$ decoder LLMs trained on $1T$ tokens, with an enhanced $2T$-token iteration (Danube2) that achieves state-of-the-art performance among open models below $2B$ parameters. The work combines architectural choices inspired by Llama 2 and Mistral with data-stage training, FP8 acceleration, and a rigorous SFT+ DPO dialogue-tuning pipeline to produce competitive base and chat models. The models are released under Apache $2.0$, enabling commercial use and community fine-tuning, and they demonstrate strong performance on commonsense reasoning, world knowledge, and reading comprehension benchmarks as well as Open LLM Leaderboard rankings. This open, permissive release aims to democratize access to capable LLMs that can run on consumer hardware and be further improved by the research and developer community.

Abstract

We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

H2O-Danube-1.8B Technical Report

TL;DR

H2O-Danube presents a family of open-source, decoder LLMs trained on tokens, with an enhanced -token iteration (Danube2) that achieves state-of-the-art performance among open models below parameters. The work combines architectural choices inspired by Llama 2 and Mistral with data-stage training, FP8 acceleration, and a rigorous SFT+ DPO dialogue-tuning pipeline to produce competitive base and chat models. The models are released under Apache , enabling commercial use and community fine-tuning, and they demonstrate strong performance on commonsense reasoning, world knowledge, and reading comprehension benchmarks as well as Open LLM Leaderboard rankings. This open, permissive release aims to democratize access to capable LLMs that can run on consumer hardware and be further improved by the research and developer community.

Abstract

We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
Paper Structure (11 sections, 2 figures, 6 tables)

This paper contains 11 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Training logs. Training (top left) and validation (top right) cross-entropy loss, learning rate schedule (bottom left) and sequence length (bottom right). X-axis shows the number of tokens that have been trained up to the step.
  • Figure 2: Data stages for Danube2. The model is trained over three different stages with different data mixes. The first data stage consist of 84.5% of web data which is gradually decreasing to 72.8% at the second stage, and to 55.5% at the third stage. The first two stages include the majority of the tokens: 1T and 0.95T tokens respectively, while third stage comprises of 0.05T tokens.