Table of Contents
Fetching ...

GEITje 7B Ultra: A Conversational Model for Dutch

Bram Vanroy

TL;DR

This work tackles the scarcity of Dutch-language capabilities in open-source LLMs by extending GEITje through supervised finetuning on Dutch synthetic conversations and a subsequent preference-alignment phase. It introduces two SFT datasets (Ultra Chat 200k Dutch and No Robots Dutch) and two cleaned preference datasets (Ultra Feedback Dutch and Orca DPO Pairs Dutch) to train and align the model, with extensive filtering to ensure Dutch content. The two-stage training pipeline—SFT followed by Direct Preference Optimization—produces GEITje 7B Ultra SFT and GEITje 7B Ultra, an alignment-focused Dutch LLM released openly alongside the datasets. While GPT-4-driven baselines still lead on benchmark tasks, Ultra demonstrates competitive Dutch conversational fluency and alignment, highlighting practical impact for Dutch-speaking users and researchers while acknowledging benchmarking limitations for measuring real-world usefulness.

Abstract

Language models have rapidly evolved, predominantly focusing on English while often neglecting extensive pretraining in other languages. This approach has required initiatives to adapt powerful, English-centric models to other linguistic contexts through finetuning. For Dutch, such a recent endeavour is ``GEITje'' a model originally derived from the English-based Mistral 7B. Building on this fundamental work, the current research extends the capabilities of GEITje by supervised finetuning on newly created high-quality synthetic conversational datasets, along with an additional preference alignment procedure on a synthetic feedback dataset. Both the developed models and the created datasets are openly available.

GEITje 7B Ultra: A Conversational Model for Dutch

TL;DR

This work tackles the scarcity of Dutch-language capabilities in open-source LLMs by extending GEITje through supervised finetuning on Dutch synthetic conversations and a subsequent preference-alignment phase. It introduces two SFT datasets (Ultra Chat 200k Dutch and No Robots Dutch) and two cleaned preference datasets (Ultra Feedback Dutch and Orca DPO Pairs Dutch) to train and align the model, with extensive filtering to ensure Dutch content. The two-stage training pipeline—SFT followed by Direct Preference Optimization—produces GEITje 7B Ultra SFT and GEITje 7B Ultra, an alignment-focused Dutch LLM released openly alongside the datasets. While GPT-4-driven baselines still lead on benchmark tasks, Ultra demonstrates competitive Dutch conversational fluency and alignment, highlighting practical impact for Dutch-speaking users and researchers while acknowledging benchmarking limitations for measuring real-world usefulness.

Abstract

Language models have rapidly evolved, predominantly focusing on English while often neglecting extensive pretraining in other languages. This approach has required initiatives to adapt powerful, English-centric models to other linguistic contexts through finetuning. For Dutch, such a recent endeavour is ``GEITje'' a model originally derived from the English-based Mistral 7B. Building on this fundamental work, the current research extends the capabilities of GEITje by supervised finetuning on newly created high-quality synthetic conversational datasets, along with an additional preference alignment procedure on a synthetic feedback dataset. Both the developed models and the created datasets are openly available.

Paper Structure

This paper contains 17 sections, 1 table.