Table of Contents
Fetching ...

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan

TL;DR

This work introduces a principled framework to separate steerability from producibility in generative models and proposes a goal-induced benchmark to measure steerability in isolation. Through large-scale human studies on text-to-image models and LLMs, it reveals widespread poor steerability and highlights that even high-quality outputs do not guarantee effective user-directed steering. The authors develop a Blinder-like decomposition to disentangle improvements due to steering mechanisms from those due to producible output sets, showing both contribute substantially. Importantly, they demonstrate that steerability can be meaningfully improved with a simple image-based steering approach and an RL-learned steering distribution, achieving over 2× gains in steerability relative to text prompting. The findings underscore the need to prioritize steerability in model design and interaction interfaces to deliver goal-directed, user-satisfying performance in real-world use cases.

Abstract

How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical decomposition for quantifying steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in user studies of text-to-image and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerability. These results suggest that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: simple image-based steering mechanisms achieve more than 2x improvement on this benchmark.

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

TL;DR

This work introduces a principled framework to separate steerability from producibility in generative models and proposes a goal-induced benchmark to measure steerability in isolation. Through large-scale human studies on text-to-image models and LLMs, it reveals widespread poor steerability and highlights that even high-quality outputs do not guarantee effective user-directed steering. The authors develop a Blinder-like decomposition to disentangle improvements due to steering mechanisms from those due to producible output sets, showing both contribute substantially. Importantly, they demonstrate that steerability can be meaningfully improved with a simple image-based steering approach and an RL-learned steering distribution, achieving over 2× gains in steerability relative to text prompting. The findings underscore the need to prioritize steerability in model design and interaction interfaces to deliver goal-directed, user-satisfying performance in real-world use cases.

Abstract

How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical decomposition for quantifying steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in user studies of text-to-image and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerability. These results suggest that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: simple image-based steering mechanisms achieve more than 2x improvement on this benchmark.

Paper Structure

This paper contains 20 sections, 11 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: A goal image and attempts by users to steer toward it. To ensure the goal image is producible, it is sampled from the same model used for steering (here, DALL-E 3). Bolding added to prompts for emphasis. See \ref{['app:fig:book_attempts_full_prompts']} for the full prompts.
  • Figure 2: Human ratings (0-10 scale) measuring the similarity between human-steered text-to-image model outputs and their corresponding goal images. Left: For each model, the average similarity of all user-generated images to their goal images. Right: For each model, the average similarity of users' first and last attempted generated images to their goal images. Bars represent single standard errors.
  • Figure 3: Left: Steerability is poor for all models. Imp shows Improvement Rate and POM-1 and POM-5 show Prompt-Output Misalignment on users’ 1st and 5th attempts, respectively (standard errors in parentheses). Right: When a blind LLM is given the same number of attempts as humans to rewrite a prompt, it achieves more than half of the human improvement. Improvement is measured by the difference between a user's first and best DreamSim score with the goal image (bars represent single standard errors).
  • Figure 4: Left: Predictive performance using CLIP-based methods to predict steerability (standard errors in parentheses). Right: Each model's steerability improvement over DALL-E 2 broken into two components: one based on differences in the steering mechanism, the other based on differences in the producible set (\ref{['eqn:blinder_decomposition']}).
  • Figure 5: Rankings of the steerability of image generation models according to human ratings, DreamSim, and CLIP embedding cosine similarity.
  • ...and 15 more figures