Table of Contents
Fetching ...

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov

TL;DR

This work tackles the problem of producing fluent, instruction-following language outputs for lower-resource languages without native instruction datasets. It introduces a fluency-aware post-training pipeline that uses on-policy reinforcement learning with AI-based feedback from an LLM judge, avoiding exposure to translated data. Through a Norwegian Bokmål case study, it demonstrates that on-policy alignment yields substantially more fluent outputs than translated-SFT baselines, and that fluency can be bootstrapped from disfluent judges as long as they understand the target language. The study also provides extensive ablations, automatic fluency metrics, and human evaluations to support the claim that avoiding translated data and relying on on-policy feedback enhances fluency and robustness for low-resource language settings.

Abstract

We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

TL;DR

This work tackles the problem of producing fluent, instruction-following language outputs for lower-resource languages without native instruction datasets. It introduces a fluency-aware post-training pipeline that uses on-policy reinforcement learning with AI-based feedback from an LLM judge, avoiding exposure to translated data. Through a Norwegian Bokmål case study, it demonstrates that on-policy alignment yields substantially more fluent outputs than translated-SFT baselines, and that fluency can be bootstrapped from disfluent judges as long as they understand the target language. The study also provides extensive ablations, automatic fluency metrics, and human evaluations to support the claim that avoiding translated data and relying on on-policy feedback enhances fluency and robustness for low-resource language settings.

Abstract

We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

Paper Structure

This paper contains 59 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Left: Reinforcement-learning cycle. This diagram demonstrates the sequential nature of online RL training: each training step starts by sampling new responses from the policy model, followed by sampling response-judgments from the reward model, and then updating the weights of the policy model based on the sampled responses and rewards. Right: Parallelization. Breaking the cycle and postponing the update of the sampled policy allows for running all three models at the same time (vertically-aligned blocks are ran concurrently on different GPU nodes).
  • Figure 2: Genealogy of the compared models. The three models compared in the main fluency test (highlighted in bold boxes) all originate from a single base model -- Mistral Nemo 12B (left).
  • Figure 3: Fluency and NLG scores throughout training. We measure the performance score every 25 training steps for the reinforcement learning (in blue) and every epoch for the SFT training.
  • Figure 4: Screenshot of the annotation tool. Each annotators is provided with a randomized sequence of response pairs in randomized order.