Table of Contents
Fetching ...

Preference Adaptive and Sequential Text-to-Image Generation

Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier

TL;DR

The Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent.

Abstract

We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.

Preference Adaptive and Sequential Text-to-Image Generation

TL;DR

The Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent.

Abstract

We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.

Paper Structure

This paper contains 45 sections, 1 theorem, 30 equations, 21 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.4

dorfman2021offline. Consider a data collected according to the settings in def:user_ambiguity. The data is identifiable if, for every pair of distinct user types $i \neq j$, there exists an identifying slate of images that overlaps.

Figures (21)

  • Figure 1: An illustration of an agent-user interaction with $L=4$ prompt expansions at each step, and $M=1$ images per prompt expansion. The user selection is outlined in blue. The agent presents prompt expansions based on the user's previous responses to maximizes the expected cumulative user satisfaction (i.e., value). See \ref{['appdx:pasta_examples']} for additional examples using $M=4$ images for $L=4$ prompt expansions.
  • Figure 2: Top. PASTA policy framework: The LMM is used as a candidate generator of a candidate set, from which the candidate selector policy is used to select a slate. Bottom. Each prompt in the slate is evaluated individually using a prompt-value model, and the overall slate value is calculated as the average of the individual prompt values.
  • Figure 3: Score function architecture. The image and prompt are fed into CLIP encoders followed by user encoders to generate $K$ user embeddings. The score for each user type is the inner product between the corresponding user image embedding and user text embedding.
  • Figure 4: The graphs present the performance of a trained user model as a function of the number of user types considered. The top row displays the model's accuracy on the Pick-a-Pic test set (left) and its Spearman's rank correlation on the HPS test set (right). The bottom row shows the model's choice accuracy (left) and cross-turn preference accuracy (right), both evaluated on our human-rated test set.
  • Figure 5: Emergence of user-specific preferences: Each row displays the top five images---scored by the user model, prior to fine-tuning---from the HPS test set for one of five specific user types. We highlight user types where differences across the top five images were especially salient. The category labels (Animals, Food, etc.) are simply meant to be evocative on the style or content of the most preferred images.
  • ...and 16 more figures

Theorems & Definitions (5)

  • Definition 1.1
  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Proposition 3.4