Table of Contents
Fetching ...

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu

TL;DR

OneReward addresses the challenge of unifying multi-task mask-guided image editing under diverse evaluation criteria by using a single Vision-Language Reward Model to guide reinforcement learning. The authors introduce Seedream 3.0 Fill, a multi-task RLHF-based model trained directly on a pre-trained base without task-specific SFT, achieving state-of-the-art results across image fill, extend, removal, and text rendering. They also present a dynamic reinforcement learning variant and open-source FLUX Fill [OneReward], demonstrating robust generalization and practical benefits for unified image editing. The work highlights the potential of unified reward modeling to streamline training and improve cross-task performance in diffusion/flow matching settings.

Abstract

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

TL;DR

OneReward addresses the challenge of unifying multi-task mask-guided image editing under diverse evaluation criteria by using a single Vision-Language Reward Model to guide reinforcement learning. The authors introduce Seedream 3.0 Fill, a multi-task RLHF-based model trained directly on a pre-trained base without task-specific SFT, achieving state-of-the-art results across image fill, extend, removal, and text rendering. They also present a dynamic reinforcement learning variant and open-source FLUX Fill [OneReward], demonstrating robust generalization and practical benefits for unified image editing. The work highlights the potential of unified reward modeling to streamline training and improve cross-task performance in diffusion/flow matching settings.

Abstract

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

Paper Structure

This paper contains 19 sections, 6 equations, 14 figures, 2 tables, 2 algorithms.

Figures (14)

  • Figure 2: Visual showcase of Seedream 3.0 Fill results across four scenario: image fill, image extend, object removal and text rendering. Each column presents a representative example with corresponding prompts and outputs, demonstrating the model’s unified capability across diverse generation objectives.
  • Figure 3: Overall pipeline of our unified RL procedure. We first random sample image and conditions from different task with a certain probability. Start with same condition and different init noise, the reference image is fully denoised using the reference model, denoted as $\pi_{ref}(\cdot)$. While the evaluation image is partially denoised with randomly selected step and directly predict $x_0^{\prime}$ based on the policy model, denoted as $\pi_{\theta}(\cdot)$. The reward model guides learning by encouraging the policy model to achieve superior performance to the reference model across all evaluation dimensions and tasks.
  • Figure 4: Illustration of the pairwise annotation process. Given multiple candidate outputs for the same prompt and binary mask, annotators identify the best and worst samples under each evaluation dimension to form a winner/loser pair. If the differences between candidates are negligible, the dimension is discarded (denoted by $\emptyset$), ensuring that only informative comparisons are retained. To clarify, this showcase uses an all-one mask, meaning the entire image region is generated.
  • Figure 5: The detail of our one reward model. We utilize VLM to judge whether the first image is better than the second one. In the process of reward feedback learning, the probability of $y^+$ token is treated as the reward to the diffusion models. We simplely add the edit task and the evaluation dimensions to the user query, achieving the goal of training for different task and dimensions. The content of angle brackets is optional, only add when the evaluation dimension is Text Alignment.
  • Figure 6: We visualize the reward curves of Consistency, Structure, Text Alignment, Aesthetics for image fill(blue) and image extend(green), Removal Quality for object removal(orange).
  • ...and 9 more figures