Table of Contents
Fetching ...

Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

TL;DR

This work tackles the underrepresentation of Hebrew in large language models by adapting a pre-trained open-weight model through tokenizer extension, embedding distillation, and instruction-tuning. The authors train two models, DictaLM2.0 and DictaLM2.0-Instruct, on roughly 100B Hebrew–English tokens each, plus a comprehensive Hebrew benchmark suite to evaluate QA, sentiment, Winograd resolution, translation, and summarization. Key contributions include a hybrid adaptation pipeline, a Hebrew open benchmark, and an open Hebrew LLM leaderboard that enables fair cross-model comparisons. The results demonstrate state-of-the-art performance on several Hebrew NLP tasks and offer a scalable framework for adapting other non-English languages, advancing multilingual NLP and accessibility for low-resource languages.

Abstract

Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

TL;DR

This work tackles the underrepresentation of Hebrew in large language models by adapting a pre-trained open-weight model through tokenizer extension, embedding distillation, and instruction-tuning. The authors train two models, DictaLM2.0 and DictaLM2.0-Instruct, on roughly 100B Hebrew–English tokens each, plus a comprehensive Hebrew benchmark suite to evaluate QA, sentiment, Winograd resolution, translation, and summarization. Key contributions include a hybrid adaptation pipeline, a Hebrew open benchmark, and an open Hebrew LLM leaderboard that enables fair cross-model comparisons. The results demonstrate state-of-the-art performance on several Hebrew NLP tasks and offer a scalable framework for adapting other non-English languages, advancing multilingual NLP and accessibility for low-resource languages.

Abstract

Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.
Paper Structure (20 sections, 1 equation, 5 figures, 2 tables)

This paper contains 20 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: This figure illustrates the relationship between the number of added tokens and the compression ratio in Hebrew (tokens per word).
  • Figure 2: This graph depicts the changes applied to the regular causal mask (dark gray) for our document-attention causal mask (light purple), ensuring tokens from separate documents are masked to restrict cross-document attention.
  • Figure 3: This graph depicts the loss value during the continuous pre-training stage.
  • Figure 4: Comparison of evaluation results between our model and other base models on Hebrew few-shot tasks.
  • Figure 5: Human evaluation results from a blind test comparing translations from our model and Google Translate.