RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
Dongyub Jude Lee, Zhenyi Ye, Pengcheng He
TL;DR
This work tackles the reliance on curated triplets for MT optimization by introducing RLfR, a framework that leverages continuous feedback from a frozen teacher to iteratively refine actor translations. By combining a generate–refine–reinforce loop with a composite reward that blends negative edit distance and semantic adequacy from COMET, RLfR achieves model-aware, data-efficient learning without static references. The actor is initialized with GPT-4o-mini distilled data and is online refined via multi-sample teacher feedback, stabilized by batch-normalised REINFORCE++ with KL regularization. Evaluations on FLORES-200 show RLfR consistently outperforms MT-SFT, DPO, and fixed-reference RL in COMET and M-ETA across multiple language directions and model sizes, supporting its practical impact for scalable, high-quality translation.
Abstract
Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
