VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu; Yimin Du; Qi An; Xin He; Cunqi Zhai; Fei Tan; Weijia Lin; Xiaochun Gong; Yongchao Deng; Shousheng Jia; Xiangzheng Zhang

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

Abstract

Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Abstract

Paper Structure (17 sections, 6 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Methods
Supervised Fine-Tuning.
Variable Entropy Alignment.
Optimization Recipe.
Experiments
Constraint Verification Performance
Algorithmic Comparison and Mechanism Analysis
Length Control Strategy Analysis
Performance Translation Benchmarking and Generalization Analysis
Analysis
Why High Entropy Facilitates Paraphrastic Translation
Entropy-Regularized Optimization Geometry
Temperature Consistent Ratios for Unbiased Optimization
...and 2 more sections

Figures (8)

Figure 1: Comparison of translation quality assessment across different models.
Figure 2: Policy entropy dynamics across RL algorithms and KL regimes.
Figure 3: Response length stability across six RL algorithms and three KL regimes (18 independent runs).
Figure 4: Sensitivity analysis of explicit length penalties. External constraints induce reward instability and entropy oscillations, leading to training divergence.
Figure 5: Results of pairwise ranking comparisons from professional human evaluations of translation samples.
...and 3 more figures

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Abstract

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Authors

Abstract

Table of Contents

Figures (8)