Table of Contents
Fetching ...

MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, Feng Zhao

TL;DR

MaskFocus tackles reinforcement learning for masked generative models by focusing policy optimization on the most informative sampling steps. It introduces Critical Step Selection based on information gain from step-to-final image embeddings and adds Dynamic Routing Sampling to balance exploration across samples with varying entropy. Empirical results on GenEval and T2I-CompBench show improved compositional control and image quality, surpassing prior MaskGRPO and approaching diffusion-model baselines. The approach reduces computational burden while enhancing fidelity and instruction-following in text-to-image synthesis.

Abstract

Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.

MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

TL;DR

MaskFocus tackles reinforcement learning for masked generative models by focusing policy optimization on the most informative sampling steps. It introduces Critical Step Selection based on information gain from step-to-final image embeddings and adds Dynamic Routing Sampling to balance exploration across samples with varying entropy. Empirical results on GenEval and T2I-CompBench show improved compositional control and image quality, surpassing prior MaskGRPO and approaching diffusion-model baselines. The approach reduces computational burden while enhancing fidelity and instruction-following in text-to-image synthesis.

Abstract

Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Motivation of our method. (a) The masked tokens in the early steps determine the appearance and structure of the image, containing sufficient and effective information. (b) The left figure represents the cosine similarity $S_t$ between the image embedding $E_t$ at each step and the final embedding $E_T$. The right figure presents the absolute difference between consecutive steps, computed based on the similarity from the left figure. The image variation throughout the sampling process is not uniform. In particular, certain steps in the early stage have a more significant impact on the generated image. (c) Different samples exhibit different entropy trajectories during generation. Lower entropy implies more deterministic sampling, which limits exploration and makes it less likely to produce higher image quality.
  • Figure 2: Overview of our method. 1) Dynamic Routing Sampling (DR-Sampling). During the sampling process, we perform a more exploratory sampling strategy on low-entropy samples, while using normal sampling on high-entropy samples. 2) Critical Step Select (CSS). Then, we determine the critical steps in the sampling trajectories and obtain the corresponding masks based on the cosine similarity between the intermediate embeddings and the final generated embedding. 3) We randomly shuffle masks and re-mask the generated tokens and predict the probabilities of these masked tokens to optimize the training objective (see left). Detail above procedures are in Alg. \ref{['alg:method']}.
  • Figure 3: Qualitative Comparison. Our approach demonstrates superior performance in image quality and human preference (top two rows), as well as in instruction-following tasks involving Counting, Colors, Attribute Binding, and Position (bottom two rows).
  • Figure 4: More comparison results on step selection strategy, sampling strategy, CFG, and mask strategy.
  • Figure 5: Pre-/Post-RL sampling trajectories.