G-Refine: A General Quality Refiner for Text-to-Image Generation

Chunyi Li; Haoning Wu; Hongkun Hao; Zicheng Zhang; Tengchaun Kou; Chaofeng Chen; Lei Bai; Xiaohong Liu; Weisi Lin; Guangtao Zhai

G-Refine: A General Quality Refiner for Text-to-Image Generation

Chunyi Li, Haoning Wu, Hongkun Hao, Zicheng Zhang, Tengchaun Kou, Chaofeng Chen, Lei Bai, Xiaohong Liu, Weisi Lin, Guangtao Zhai

TL;DR

G-Refine tackles inconsistent quality in Text-to-Image generation by introducing perceptual and alignment quality indicators (PQ-Map and AQ-Map) and a two-stage quality refiner. By mapping quality defects and semantically aligning prompts to image regions, it applies targeted denoising to improve low-quality areas while preserving high-quality content. Across four AIGI datasets and multiple models, G-Refine achieves broad improvements across a wide set of perceptual and alignment metrics, with minimal negative effects. The approach enables online, prompt-based refinement without modifying the underlying generative models, accelerating industrial adoption of T2I systems.

Abstract

With the evolution of Text-to-Image (T2I) models, the quality defects of AI-Generated Images (AIGIs) pose a significant barrier to their widespread adoption. In terms of both perception and alignment, existing models cannot always guarantee high-quality results. To mitigate this limitation, we introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising the integrity of high-quality ones. The model is composed of three interconnected modules: a perception quality indicator, an alignment quality indicator, and a general quality enhancement module. Based on the mechanisms of the Human Visual System (HVS) and syntax trees, the first two indicators can respectively identify the perception and alignment deficiencies, and the last module can apply targeted quality enhancement accordingly. Extensive experimentation reveals that when compared to alternative optimization methods, AIGIs after G-Refine outperform in 10+ quality metrics across 4 databases. This improvement significantly contributes to the practical application of contemporary T2I models, paving the way for their broader adoption. The code will be released on https://github.com/Q-Future/Q-Refine.

G-Refine: A General Quality Refiner for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (13 sections, 10 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 13 sections, 10 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Related Works
Proposed Method
Perceptual Quality Indicator
Alignment Quality Indicator
Quality Refiner
Experiment
Validation Databases
Experiment Settings
Quality Optimization Results
Quality Assessment Results
Ablation Study
Conclusion

Figures (5)

Figure 1: Framework of G-Refine, including a perceptual quality and an alignment quality indicator module. The refining process is targeted at optimizing unmatched prompts and both maps. The perceptual quality is optimized by introducing more texture while the alignment quality is improved by implementing "blowing in the wind" into the image.
Figure 2: Using overall perceptual quality, and (technical, rational, natural) defected images to train the CLIP model. Both image and text encoder are modified in terms of self-attention and token embedding.
Figure 3: Visualization result of the perceptual quality map, using the original CLIP or improved image/text encoders. The original image encoder generates meaningless results while the original text encoder labels reversely. Reasonable results are available only when two encoders are modified.
Figure 4: The mechanism of identifying alignment quality defects. Include syntax tree construction, quality defect identification, and mask processing. Both unmatched nouns and adjectives can be enhanced on their correlated region.
Figure 5: Radar maps for G-Refine on different original generative models.

G-Refine: A General Quality Refiner for Text-to-Image Generation

TL;DR

Abstract

G-Refine: A General Quality Refiner for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)