Table of Contents
Fetching ...

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang

TL;DR

Likelihood-Free Policy Optimization (LFPO) is proposed, a native framework that maps the concept of vector field flow matching to the discrete token space and effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

TL;DR

Likelihood-Free Policy Optimization (LFPO) is proposed, a native framework that maps the concept of vector field flow matching to the discrete token space and effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
Paper Structure (29 sections, 1 theorem, 20 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 1 theorem, 20 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\boldsymbol{z} \in \mathbb{R}^V$ be the pre-softmax logits such that $p_\theta = \text{Softmax}(\boldsymbol{z})$. For any time step $t$, the gradient of $\mathcal{L}_{CE}$ with respect to the logits $\boldsymbol{z}$ is exactly the residual error vector between the model velocity $v_\theta$ and

Figures (3)

  • Figure 1: Overview of the LFPO framework. The training pipeline consists of four distinct phases: Step 1Generate & Estimate Rewards: The reference policy $\pi_{\text{old}}$ generates trajectories, and representative timesteps are selected via Stratified Trajectory Sampling to reduce variance. Step 2Block-wise Rectified Optimization: Data is partitioned into memory-efficient blocks to enable parallel logit computation. Step 3Policy model Update: The policy $\pi_\theta$ is optimized to minimize the deviation from reward-induced implicit policies ($\pi^+$ and $\pi^-$), effectively performing vector field rectification. Step 4Reference model Update: The reference model is stably updated via Exponential Moving Average (EMA).
  • Figure 2: Geometric interpretation of Discrete Lifting (Section \ref{['subsec:discrete_lifting']}). We visualize the probability simplex $\Delta^{2}$ for a toy vocabulary of $|V|=3$. Vertices (Token A, B, C) represent deterministic one-hot data states (e.g., the ground truth target $\boldsymbol{x}_1$ corresponds to Token B = $[0,1,0]$). Interior Points: (1) $\boldsymbol{m}$: The Mask Prior (center, $[0.33, 0.33, 0.33]$), serving as the geometric origin of the flow; (2) $\boldsymbol{x}_t$: The Current State, modeled as a linear interpolation between the masked state $\boldsymbol{m}$ and target $\boldsymbol{x}_1$; (3) $P_\theta$: The Model Prediction, a categorical distribution over the vocabulary output by the network. Vectors (Velocities): The Ideal Velocity$u_t$ (black arrow) points from the mask towards the true target $\boldsymbol{x}_1$. Crucially, the Model Velocity$v_\theta$ (green arrow) is defined as the displacement from the mask $\boldsymbol{m}$ to the prediction $P_\theta$ (Eq. 1). The red dashed arrow $-\nabla L_{CE}$ illustrates the optimization direction, rectifying the model velocity towards the ground truth.
  • Figure 3: Convergence Analysis on Code and Reasoning Tasks. The plots show accuracy progression against training time (GPU Hours). The red curve represents our proposed LFPO, while the blue curve represents the baseline AGRPO. The horizontal dashed line marks the final converged accuracy of the baseline. Notably, LFPO requires substantially less training time to match or surpass the baseline's best performance, highlighting its superior sample efficiency and convergence speed.

Theorems & Definitions (1)

  • Theorem 3.1: Gradient Alignment