Table of Contents
Fetching ...

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

TL;DR

This work tackles the reliance on curated triplets for MT optimization by introducing RLfR, a framework that leverages continuous feedback from a frozen teacher to iteratively refine actor translations. By combining a generate–refine–reinforce loop with a composite reward that blends negative edit distance and semantic adequacy from COMET, RLfR achieves model-aware, data-efficient learning without static references. The actor is initialized with GPT-4o-mini distilled data and is online refined via multi-sample teacher feedback, stabilized by batch-normalised REINFORCE++ with KL regularization. Evaluations on FLORES-200 show RLfR consistently outperforms MT-SFT, DPO, and fixed-reference RL in COMET and M-ETA across multiple language directions and model sizes, supporting its practical impact for scalable, high-quality translation.

Abstract

Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

TL;DR

This work tackles the reliance on curated triplets for MT optimization by introducing RLfR, a framework that leverages continuous feedback from a frozen teacher to iteratively refine actor translations. By combining a generate–refine–reinforce loop with a composite reward that blends negative edit distance and semantic adequacy from COMET, RLfR achieves model-aware, data-efficient learning without static references. The actor is initialized with GPT-4o-mini distilled data and is online refined via multi-sample teacher feedback, stabilized by batch-normalised REINFORCE++ with KL regularization. Evaluations on FLORES-200 show RLfR consistently outperforms MT-SFT, DPO, and fixed-reference RL in COMET and M-ETA across multiple language directions and model sizes, supporting its practical impact for scalable, high-quality translation.

Abstract

Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

Paper Structure

This paper contains 24 sections, 14 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: RLfR follows an incremental generate--refine--reinforce cycle. A policy actor is first initialised, then refined online by a frozen teacher model, and finally updated with a critic-free REINFORCE++ objective that uses multi-sampling.
  • Figure 2: COMET scores on the FLORES test set using the LLaMA-3.1 8B model, evaluated with different amounts of distilled SFT training data per language pair.
  • Figure 3: Comparison of translations for “Jia Yingchun.” SFT introduces semantic distortion, whereas RLfR produces a phonetically faithful and contextually valid variant.
  • Figure 4: Stepwise dynamics of teacher-guided refinement for the ZLM-2.3B model trained with a mixture of supervised and refinement-based updates. Reward rises monotonically, response length stays within a narrow band, and COMET-22 peaks at step 416—aligning with the reward plateau—before marginally tapering off.