Table of Contents
Fetching ...

Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

Katherine M. Collins, Najoung Kim, Yonatan Bitton, Verena Rieser, Shayegan Omidshafiei, Yushi Hu, Sherol Chen, Senjuti Dutta, Minsuk Chang, Kimin Lee, Youwei Liang, Georgina Evans, Sahil Singla, Gang Li, Adrian Weller, Junfeng He, Deepak Ramachandran, Krishnamurthy Dj Dvijotham

TL;DR

This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback, and identifies key challenges in eliciting and utilizing fine-grained feedback.

Abstract

Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.

Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

TL;DR

This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback, and identifies key challenges in eliciting and utilizing fine-grained feedback.

Abstract

Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.

Paper Structure

This paper contains 31 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Example text-image pair where granular feedback matters.
  • Figure 2: Top: a typical coarse-grained feedback reward pipeline; bottom: proposed method for modeling fine-grained feedback.
  • Figure 3: Comparing reward models trained on coarse feedback (i.e., direct human preference judgments; black) against CBM-based models learned from fine-grained feedback. Reward models are differentiated by whether they were trained on granular feedback only about image quality (blue), image-text alignment (red), or both (purple). Left: Each point represents a reward model trained on $N$ image-prompt examples (x axis); ROC-AUC of the binary reward against held-out human preference judgments is presented on the y axis. Higher is better. Right: The same reward models, where the x axis (presented on a log scale) depicts estimated annotation cost, if each attribute is assumed to be equally costly to procure.
  • Figure 4: Comparing reward models trained on varying levels of granularity. As in Figure \ref{['fig:agg-granular']}, each point represents a reward model trained on $N$ images. Models are scored according to the contrived decision tree on held-out examples. We compare a model trained directly on the single scalar decision tree scores (black) against a suite CBM-based fine-grained models trained on: 1) the same three attributes which make up the decision tree (red), 2) the same three attributes as the decision tree along with the remainder of the full set of image attributes under consideration (blue), and 3) only attributes not included in the decision tree (orange).
  • Figure 5: Estimated similarity between PaLI scores for different attributes. We depict the proportion of images in the train set for which PaLI marks an image as having the same attribute (e.g., the cell blurry and malformed highlights that PaLI marks an image as blurry and malformed, or not blurry and not malformed for 71% of the examples). Darker red means higher level of similarity in scores; yellow represents lower similarity.
  • ...and 4 more figures