Table of Contents
Fetching ...

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian

TL;DR

The paper tackles the challenge of generating multiple images with consistent identities, styles, and narratives. It introduces PaCo-RL, a framework combining PaCo-Reward for pairwise consistency with an efficient RL engine, PaCo-GRPO, featuring resolution-decoupled training and log-tamed reward aggregation to boost efficiency and stability. A large-scale Paco-Dataset underpins reward modeling, framing consistency as a generative Yes/No task with CoT reasoning to align with human judgments. Across Text-to-ImageSet and Image Editing tasks, PaCo-RL achieves state-of-the-art consistency and substantial efficiency gains, demonstrating practical potential for scalable, human-aligned consistent image generation.

Abstract

Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

TL;DR

The paper tackles the challenge of generating multiple images with consistent identities, styles, and narratives. It introduces PaCo-RL, a framework combining PaCo-Reward for pairwise consistency with an efficient RL engine, PaCo-GRPO, featuring resolution-decoupled training and log-tamed reward aggregation to boost efficiency and stability. A large-scale Paco-Dataset underpins reward modeling, framing consistency as a generative Yes/No task with CoT reasoning to align with human judgments. Across Text-to-ImageSet and Image Editing tasks, PaCo-RL achieves state-of-the-art consistency and substantial efficiency gains, demonstrating practical potential for scalable, human-aligned consistent image generation.

Abstract

Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.

Paper Structure

This paper contains 22 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the proposed PaCo-Reward framework.
  • Figure 2: Overview of our proposed PaCo-GRPO framework on Text-to-ImageSet generation task.
  • Figure 3: Training processes under different image resolutions.
  • Figure 4: Evaluation processes under different training image resolutions.
  • Figure 5: Ablation of log-tamed aggregation on the reward ratio.
  • ...and 8 more figures