Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov
TL;DR
This work tackles the problem of producing fluent, instruction-following language outputs for lower-resource languages without native instruction datasets. It introduces a fluency-aware post-training pipeline that uses on-policy reinforcement learning with AI-based feedback from an LLM judge, avoiding exposure to translated data. Through a Norwegian Bokmål case study, it demonstrates that on-policy alignment yields substantially more fluent outputs than translated-SFT baselines, and that fluency can be bootstrapped from disfluent judges as long as they understand the target language. The study also provides extensive ablations, automatic fluency metrics, and human evaluations to support the claim that avoiding translated data and relying on on-policy feedback enhances fluency and robustness for low-resource language settings.
Abstract
We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
