Table of Contents
Fetching ...

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou

TL;DR

The paper tackles fine-grained detail inconsistencies in reference-guided image generation by introducing ImageCritic, a post-editing framework equipped with a reference–degraded–target dataset, an attention alignment loss, and a detail encoder. It leverages a DiT-based editing backbone (Flux Kontext) and an automated agent chain to localize and correct discrepancies across multiple rounds, preserving global structure while fixing text and logo regions. Empirical results on DreamBench++ and CriticBench, plus extensive qualitative comparisons, show substantial improvements in detail fidelity and cross-model robustness. The work advances practical high-fidelity, reference-consistent image editing with an adaptable, agent-driven workflow that can operate across languages and styles.

Abstract

Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

TL;DR

The paper tackles fine-grained detail inconsistencies in reference-guided image generation by introducing ImageCritic, a post-editing framework equipped with a reference–degraded–target dataset, an attention alignment loss, and a detail encoder. It leverages a DiT-based editing backbone (Flux Kontext) and an automated agent chain to localize and correct discrepancies across multiple rounds, preserving global structure while fixing text and logo regions. Empirical results on DreamBench++ and CriticBench, plus extensive qualitative comparisons, show substantial improvements in detail fidelity and cross-model robustness. The work advances practical high-fidelity, reference-consistent image editing with an adaptable, agent-driven workflow that can operate across languages and styles.

Abstract

Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

Paper Structure

This paper contains 19 sections, 9 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Visual illustrations. (a) illustrates that we first conduct customized generation using GPT-4o gpt, and then apply different methods for the post editing. Edit method qwenimage struggle to achieve fine-grained consistent generation, while images processed by super-resolution methods guo2024refir often exhibit noticeable detail inaccuracies. In contrast, our proposed ImageCritic corrects local details to ensure text and logo consistency while maintaining accurate spatial alignment, significantly improving the overall coherence of generated images. (b) We further apply our method to customized results generated by both state-of-the-art closed-source nanobanana and open-source models qwenimageuno. After performing our correction, the fine-grained details of the generated images align precisely with those of the original objects, demonstrating the superior performance of our approach.
  • Figure 2: Data curation pipeline. (a) illustrates the complete pipeline of our approach, which involves generating customized images using existing state-of-the-art models, applying VLM-based filtering, and performing degradation. (b) shows local regions from our dataset, where the target patch aligns well with the input patch, and the degraded patch effectively simulates fine-grained inconsistencies in text and logo areas commonly seen between the input patch and the generated patch.
  • Figure 3: Overview of the proposed ImageCritic. We propose ImageCritic, which employs a Detail Encoder and an Attention Alignment Loss to enable the model to localize regions requiring restoration, thereby achieving high-quality and consistent image correcting. Furthermore, we develop a fully automated agent framework that supports both local patch restoration and multi-round correcting processes.
  • Figure 4: Attention visualization. We separately extract the noise attention maps with respect to the reference image and the input image to be corrected, denoted as $M_R$ and $M_I$, respectively. The first row shows the results of the LoRA-finetuned base model, while the second row presents the results after applying the attention alignment loss. The first two columns correspond to the attention map of the double stream layer, and the last two columns correspond to the single stream layer. It can be observed that the attention alignment loss effectively promotes attention disentanglement.
  • Figure 5: Effect of the detail encoder. We find that when the input image exhibits structural differences from the reference image, the model fails to correctly identify the intended reference object, leading to inconsistent generation results.
  • ...and 9 more figures