Table of Contents
Fetching ...

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibing Huang, Chi Zhang, Xuelong Li

TL;DR

The paper tackles the limitation of scalar rewards in GRPO-based visual generation by introducing ViPO, which uses a Perceptual Structuring Module to generate pixel-level, region-aware advantages. By extracting perceptual cues from pretrained vision backbones and constructing allocation maps, ViPO redistributes learning pressure toward perceptually salient regions, improving both image and video generation fidelity and alignment with human preferences. Empirical results across Flux (image) and Wan2.1 (video) show stronger in-domain performance and better out-of-domain generalization, with ablations validating the effectiveness of the allocation map, variance-weighted aggregation, and a three-component PCA setting. The framework remains lightweight and architecture-agnostic, compatible with existing GRPO pipelines, and points to future directions in structured, region-aware policy learning for high-dimensional visual tasks.

Abstract

Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

TL;DR

The paper tackles the limitation of scalar rewards in GRPO-based visual generation by introducing ViPO, which uses a Perceptual Structuring Module to generate pixel-level, region-aware advantages. By extracting perceptual cues from pretrained vision backbones and constructing allocation maps, ViPO redistributes learning pressure toward perceptually salient regions, improving both image and video generation fidelity and alignment with human preferences. Empirical results across Flux (image) and Wan2.1 (video) show stronger in-domain performance and better out-of-domain generalization, with ablations validating the effectiveness of the allocation map, variance-weighted aggregation, and a three-component PCA setting. The framework remains lightweight and architecture-agnostic, compatible with existing GRPO pipelines, and points to future directions in structured, region-aware policy learning for high-dimensional visual tasks.

Abstract

Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

Paper Structure

This paper contains 15 sections, 14 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Brief illustration of our work. Existing GRPO for visual generation assigns a single scalar advantage to the entire content, producing coarse feedback that often leads to sub‑optimal results. In contrast, our ViPO converts this coarse signal into preference‑aware feedback, enabling fine‑grained alignment. This allows, for instance, differentiated optimization of the dancing doll and its background, yielding outputs that are more coherent, harmonious, and perceptually pleasing.
  • Figure 2: Overview framework of the proposed Visual Preference Policy Optimization (ViPO). Policy-sampled outputs are first evaluated by the reward model to obtain scalar advantages. In parallel, the samples are processed by the Perceptual Structuring Module (PSM) to produce allocation maps. The allocation maps are then combined with the scalar advantages to yield pixel-level, preference-aware advantages, which guide fine-grained visual preference policy optimization.
  • Figure 3: Qualitative comparison on Flux. Each group of results is arranged from left to right as follows: outputs from Flux, DanceGRPO, and our proposed ViPO. Our method demonstrates the best visual performance, exhibiting richer details, more realistic rendering, and overall superior perceptual quality.
  • Figure 4: Qualitative comparison on Wan2.1. Each demo group is arranged top-to-bottom as follows: the result from Wan2.1, the output after applying DanceGRPO, and the output after applying ViPO. It is evident that our method delivers superior performance in terms of visual quality, and motion dynamics. In addition, we highlight representative regions with red boxes to indicate improvements over the Wan2.1, and green boxes to indicate improvements over DanceGRPO.
  • Figure 5: Comparison under the redness reward across training steps. As training progresses, results from DanceGRPO tend to suffer from semantic degradation and structural collapse. In contrast, ViPO consistently maintains the original semantic intent and structural integrity.
  • ...and 5 more figures