Table of Contents
Fetching ...

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

Abstract

Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Abstract

Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
Paper Structure (17 sections, 6 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of translation quality assessment across different models.
  • Figure 2: Policy entropy dynamics across RL algorithms and KL regimes.
  • Figure 3: Response length stability across six RL algorithms and three KL regimes (18 independent runs).
  • Figure 4: Sensitivity analysis of explicit length penalties. External constraints induce reward instability and entropy oscillations, leading to training divergence.
  • Figure 5: Results of pairwise ranking comparisons from professional human evaluations of translation samples.
  • ...and 3 more figures