Table of Contents
Fetching ...

Discriminative Probing and Tuning for Text-to-Image Generation

Leigang Qu, Wenjie Wang, Yongqi Li, Hanwang Zhang, Liqiang Nie, Tat-Seng Chua

TL;DR

A discriminative adapter built on T2I models is presented to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment.

Abstract

Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.

Discriminative Probing and Tuning for Text-to-Image Generation

TL;DR

A discriminative adapter built on T2I models is presented to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment.

Abstract

Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.
Paper Structure (30 sections, 9 equations, 15 figures, 11 tables)

This paper contains 30 sections, 9 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Illustration of the (a) text-image misalignment problem and (b) our motivation by enhancing discriminative abilities of T2I models to promote generative abilities. We list three wrong generation results generated by SD-v2.1 rombach2022high with regard to attribute binding, counting error, and relation confusion in (a).
  • Figure 2: Schematic illustration of the proposed discriminative probing and tuning (DPT) framework. We first extract semantic representations from the frozen SD and then propose a discriminative adapter to conduct discriminative probing to investigate the global matching and local grounding abilities of SD. Afterward, we perform parameter-efficient discriminative tuning by introducing LoRA parameters. During inference, we present the self-correction mechanism to guide the denoising-based text-to-image generation.
  • Figure 3: Generative and discriminative results by probing different layers of U-Net in SD-v2.1 and adapting to ITM and REC. We report average CLIP and BLIP-M scores over COCO-NSS1K and CC-500, overall matching performance on MSCOCO-HN, and average grounding performance over all test sets of RefCOCO, RefCOCO+, and RefCOCOg. We conduct model selection based on T2I performance on the validation set of COCO-NSS1K.
  • Figure 4: Impact of (a) the variation of generation and discrimination performance with the progress of tuning and (b) the self-correction strength on the performance of T2I on CC-500.
  • Figure 5: Qualitative results on COCO-NSS1K. We compare DPT with SD-v2.1 and two baselines including Attend-and-Excite (AaE) chefer2023attend and HN-DiffusionITM (HN-DiffITM) krojer2023diffusion regarding object appearance, counting, spatial relation, semantic relation, and compositional reasoning. Categories and the corresponding keywords in prompts are highlighted with different colors.
  • ...and 10 more figures