Rich Human Feedback for Text-to-Image Generation

Youwei Liang; Junfeng He; Gang Li; Peizhao Li; Arseniy Klimovskiy; Nicholas Carolan; Jiao Sun; Jordi Pont-Tuset; Sarah Young; Feng Yang; Junjie Ke; Krishnamurthy Dj Dvijotham; Katie Collins; Yiwen Luo; Yang Li; Kai J Kohlhoff; Deepak Ramachandran; Vidhya Navalpakkam

Rich Human Feedback for Text-to-Image Generation

Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katie Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, Vidhya Navalpakkam

TL;DR

This work addresses pervasive artifacts, misalignment, and aesthetics issues in text-to-image generation by introducing RichHF-18K, a rich human feedback dataset with region-level artifact/misalignment heatmaps and misaligned keywords. It then trains a multimodal transformer, Rich Automatic Human Feedback (RAHF), to predict these rich annotations, enabling actionable guidance for data selection and region inpainting. The authors demonstrate that predicted feedback can improve generation by finetuning models like Muse and by region-aware inpainting, with demonstrated generalization beyond the training model. The dataset and RAHF predictions offer interpretable, fine-grained feedback that can drive more reliable and controllable T2I generation in practical applications.

Abstract

Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.

Rich Human Feedback for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (43 sections, 16 figures, 5 tables)

This paper contains 43 sections, 16 figures, 5 tables.

Introduction
Related works
Text-to-image generation
Text-to-image evaluation and reward models
Collecting rich human feedback
Data collection process
Human feedback consolidation
RichHF-18K: a dataset of rich human feedback
Data statistics of RichHF-18K
Predicting rich human feedback
Models
Architecture
Model variants
Multi-head
Augmented prompt
...and 28 more sections

Figures (16)

Figure 1: An illustration of our annotation UI. Annotators mark points on the image to indicate artifact/implausibility regions (red points) or misaligned regions (blue points) w.r.t the text prompt. Then, they click on the words to mark the misaligned keywords (underlined and shaded) and choose the scores for plausibility, text-image alignment, aesthetics, and overall quality (underlined).
Figure 2: Histograms of the average scores of image-text pairs in the training set.
Figure 3: Architecture of our rich feedback model. Our model consists of two streams of computation: one vision and one text stream. We perform self-attention on the ViT-outputted image tokens and the Text-embed module-outputted text tokens to fuse the image and text information. The vision tokens are reshaped into feature maps and mapped to heatmaps and scores. The vision and text tokens are sent to a Transformer decoder to generate a text sequence.
Figure 4: Counts of the samples with maximum differences of the scores in the training set.
Figure 5: Examples of implausibility heatmaps. Prompt: photo of a slim asian little girl ballerina with long hair wearing white tights running on a beach from behind nikon D5
...and 11 more figures

Rich Human Feedback for Text-to-Image Generation

TL;DR

Abstract

Rich Human Feedback for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (16)