Table of Contents
Fetching ...

OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song

TL;DR

OmniRefiner tackles the problem of losing fine-grained textures in reference-guided image refinement by introducing a two-stage framework that first performs supervised finetuning to enable dual-input detail restoration with bidirectional attention, then applies GRPO-based reinforcement learning to sharpen local details while preserving global structure. It relies on a patch-wise reward design combining DreamSim-based perceptual similarity and masked pixel accuracy, optimized over a scalable, automated quadruplet data pipeline that generates training targets from degraded inputs and references. A 30K-triplet benchmark supports scalable supervision and reliable evaluation. Empirical results show state-of-the-art fidelity and detail preservation across diverse content and diffusion backbones, outperforming both open-source and commercial models on challenging reference-guided restoration tasks.

Abstract

Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

TL;DR

OmniRefiner tackles the problem of losing fine-grained textures in reference-guided image refinement by introducing a two-stage framework that first performs supervised finetuning to enable dual-input detail restoration with bidirectional attention, then applies GRPO-based reinforcement learning to sharpen local details while preserving global structure. It relies on a patch-wise reward design combining DreamSim-based perceptual similarity and masked pixel accuracy, optimized over a scalable, automated quadruplet data pipeline that generates training targets from degraded inputs and references. A 30K-triplet benchmark supports scalable supervision and reliable evaluation. Empirical results show state-of-the-art fidelity and detail preservation across diverse content and diffusion backbones, outperforming both open-source and commercial models on challenging reference-guided restoration tasks.

Abstract

Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

Paper Structure

This paper contains 23 sections, 11 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Compared with the state-of-the-art multi-image editing methods, our approach achieves not only faithful reconstruction of the original image in reference–repair tasks, but also excellent performance in various reconstruction scenarios including text, patterns, facial details, and object details. In contrast, existing methods often fail to remain faithful to the original image during repair or are unable to recover text and fine details.
  • Figure 2: Overall architecture of OmniRefiner. Our framework adopts a two-stage training pipeline. In the first stage, we perform supervised fine-tuning (SFT) to enable dual-input detail restoration while preserving global structure. In the second stage, we apply GRPO-based reinforcement learning to further enhance fine-grained consistency and local repair quality. This joint design enables precise reference-guided refinement with high visual fidelity.
  • Figure 3: We adopt a four-stage data pipeline. First, a VLM pairs images of the same product with consistent styles and reasonable viewpoints. Second, it generates fine-grained editing instructions for one image in each pair. Third, an image editing model executes these edits using the pre-edit image as ground truth, forming our (input, reference, ground truth) triplet dataset.Finally, the VLM generates an instruction guiding the model to restore the input using the reference, based on the input, reference, and ground truth.
  • Figure 4: The DreamSim reward curve and the masked MSE reward curve demonstrate the process of how our model aligns with the reward functions during GRPO.
  • Figure 5: Qualitative results demonstrate that our method can accurately restore fine details in images.
  • ...and 7 more figures