GEITje 7B Ultra: A Conversational Model for Dutch
Bram Vanroy
TL;DR
This work tackles the scarcity of Dutch-language capabilities in open-source LLMs by extending GEITje through supervised finetuning on Dutch synthetic conversations and a subsequent preference-alignment phase. It introduces two SFT datasets (Ultra Chat 200k Dutch and No Robots Dutch) and two cleaned preference datasets (Ultra Feedback Dutch and Orca DPO Pairs Dutch) to train and align the model, with extensive filtering to ensure Dutch content. The two-stage training pipeline—SFT followed by Direct Preference Optimization—produces GEITje 7B Ultra SFT and GEITje 7B Ultra, an alignment-focused Dutch LLM released openly alongside the datasets. While GPT-4-driven baselines still lead on benchmark tasks, Ultra demonstrates competitive Dutch conversational fluency and alignment, highlighting practical impact for Dutch-speaking users and researchers while acknowledging benchmarking limitations for measuring real-world usefulness.
Abstract
Language models have rapidly evolved, predominantly focusing on English while often neglecting extensive pretraining in other languages. This approach has required initiatives to adapt powerful, English-centric models to other linguistic contexts through finetuning. For Dutch, such a recent endeavour is ``GEITje'' a model originally derived from the English-based Mistral 7B. Building on this fundamental work, the current research extends the capabilities of GEITje by supervised finetuning on newly created high-quality synthetic conversational datasets, along with an additional preference alignment procedure on a synthetic feedback dataset. Both the developed models and the created datasets are openly available.
