Table of Contents
Fetching ...

Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian Price, Scott Cohen, Jianming Zhang, Daniel Aliaga

TL;DR

This work tackles localized artifacts in personalized image generation by introducing Refine-by-Align, a diffusion-based, two-stage pipeline that first aligns artifact regions to a high-quality reference via cross-attention and then refines the artifacts using a DINOv2-guided diffusion model. The alignment stage computes an optimal correspondence map $\mathbf{M}^*$ by aggregating cross-attention across timesteps and layers, while the refinement stage preserves identity by conditioning on the matched reference features. The model is trained in two modes (alignment and refinement) with self-supervised and paired data, and requires no test-time tuning during inference. Experiments on GenArtifactBench across customization, compositing, view synthesis, and virtual try-on show substantial gains in fidelity and fine-grained detail over six baselines, making artifact refinement more controllable and reliable across diverse generative pipelines.

Abstract

Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.

Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

TL;DR

This work tackles localized artifacts in personalized image generation by introducing Refine-by-Align, a diffusion-based, two-stage pipeline that first aligns artifact regions to a high-quality reference via cross-attention and then refines the artifacts using a DINOv2-guided diffusion model. The alignment stage computes an optimal correspondence map by aggregating cross-attention across timesteps and layers, while the refinement stage preserves identity by conditioning on the matched reference features. The model is trained in two modes (alignment and refinement) with self-supervised and paired data, and requires no test-time tuning during inference. Experiments on GenArtifactBench across customization, compositing, view synthesis, and virtual try-on show substantial gains in fidelity and fine-grained detail over six baselines, making artifact refinement more controllable and reliable across diverse generative pipelines.

Abstract

Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.

Paper Structure

This paper contains 26 sections, 5 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Refine-by-Align. Given a generated image (with artifacts), a free-form mask indicating the artifacts region in the generated image, and a high-quality reference image containing important details such as identity logo or font, our model can automatically refine the artifacts in the generated image by leveraging the corresponding details from the reference. The proposed method could benefit various applications (e.g., DreamBooth ruiz2023dreambooth for text-to-image customization, IDM-VTON choi2024improving for virtual try-on, AnyDoor chen2023anydoor for object composition, and Zero 1-to-3++ shi2023zero123++ for novel view synthesis.
  • Figure 2: Comparisons of our region-matching method with keypoint matching. We utilize DIFT tang2023emergent and DHF luo2023dhf to perform keypoint matching from the artifacts region (10 points are sampled along the artifacts contour) to the reference. DIFT and DHF often fail to find the accurate corresponding region; in addition, they have trouble in distinguishing between repeating patterns such as (a)(c). In contrast, our method is more robust. The results demonstrate that artifacts alignment is a non-trivial process.
  • Figure 3: Overview of our framework.Top: During training, we train a DM for object completion, guided by a reference image $I_r$. In alignment mode, the reference is a complete object, so the model learns to locate the relevant region from the reference for object completion, thus maximizing the spatial correlation in attention maps. In refinement mode, this region is directly provided as reference. Bottom: During inference, the inputs include a generated image $I_a$ with the artifacts marked as $M_a$, and a reference object $I_r$. In the alignment stage, we perform cross-attention alignment algorithm (see Alg. \ref{['alg:algorithm1']} and Fig. \ref{['fig:layer_timestep']}) to find the correspondence map ${\bm{M}}^*$. In the refinement stage, ${\bm{M}}^*$ is used to find the region in $I_r$ that corresponds to artifacts, which guides refining to $I_a$.
  • Figure 4: Running the cross-attention alignment algorithm on GenArtifactBench to find the best combination of timestep $t$ and transformer layer $l$. Left: mIoU across all timesteps, averaged over all layers and images; Right: mIoU across all layers, averaged over all timesteps and images.
  • Figure 5: Top: Visualization of our cross-attention alignment algorithm. The artifacts mask is used to extract the spatial correlations between the artifacts and the reference; the output of this algorithm, the correspondence map, indicates the region in the reference that corresponds to the artifacts area. Middle and Bottom: Correspondence maps across different transformer layers and timesteps.
  • ...and 8 more figures