OmniRefiner: Reinforcement-Guided Local Diffusion Refinement
Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song
TL;DR
OmniRefiner tackles the problem of losing fine-grained textures in reference-guided image refinement by introducing a two-stage framework that first performs supervised finetuning to enable dual-input detail restoration with bidirectional attention, then applies GRPO-based reinforcement learning to sharpen local details while preserving global structure. It relies on a patch-wise reward design combining DreamSim-based perceptual similarity and masked pixel accuracy, optimized over a scalable, automated quadruplet data pipeline that generates training targets from degraded inputs and references. A 30K-triplet benchmark supports scalable supervision and reliable evaluation. Empirical results show state-of-the-art fidelity and detail preservation across diverse content and diffusion backbones, outperforming both open-source and commercial models on challenging reference-guided restoration tasks.
Abstract
Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.
