Table of Contents
Fetching ...

Evaluating the Prompt Steerability of Large Language Models

Erik Miehling, Michael Desmond, Karthikeyan Natesan Ramamurthy, Elizabeth M. Daly, Pierre Dognin, Jesus Rios, Djallel Bouneffouf, Miao Liu

TL;DR

The paper tackles the challenge of designing pluralistic AI by introducing a prompt-based steerability benchmark that quantifies how easily a language model's persona can be steered via prompting. It defines evaluation profiles and steerability indices, normalizes changes against a model's baseline behavior using Wasserstein distances, and employs a persona dataset to assess multidimensional steering across dimensions and directions. Experiments on six models reveal notable baseline skew and directional asymmetry, with larger, more capable models showing higher yet bounded steerability. The work provides an open-source benchmark, a rigorous measurement framework, and a foundation for advancing pluralistic AI design, while outlining avenues for future exploration into multi-turn prompts and stronger links to in-context learning.

Abstract

Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Evaluating the Prompt Steerability of Large Language Models

TL;DR

The paper tackles the challenge of designing pluralistic AI by introducing a prompt-based steerability benchmark that quantifies how easily a language model's persona can be steered via prompting. It defines evaluation profiles and steerability indices, normalizes changes against a model's baseline behavior using Wasserstein distances, and employs a persona dataset to assess multidimensional steering across dimensions and directions. Experiments on six models reveal notable baseline skew and directional asymmetry, with larger, more capable models showing higher yet bounded steerability. The work provides an open-source benchmark, a rigorous measurement framework, and a foundation for advancing pluralistic AI design, while outlining avenues for future exploration into multi-turn prompts and stronger links to in-context learning.

Abstract

Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Paper Structure

This paper contains 16 sections, 8 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Models are steered along each dimension (e.g., conscientiousness as shown above) by including $k$ steering examples for the direction of interest in the model's system prompt. Profiling prompts (for the same dimension) take the form of polar (yes/no) questions.
  • Figure 2: An illustration of how the steerability indices are computed from base and steered profiles. The base distribution $p_i$ is in blue with the positively and negatively steered distributions
  • Figure 3: Base profiles (as beta distributions) for six models (two models from each of three providers: Meta, IBM, and Microsoft) across persona dimensions openness, extraversion, psychopathy, and narcissism. Profiles were obtained using $n_\text{prf}=25$ profiling questions (in each direction) across $T_e=5$ experiment trials. Plots illustrate the weighted averages of beta distributions across experiment trials.
  • Figure 4: Steerability curves, given by the steerability indices $(\gamma_{i,k}^+, \gamma_{i,k}^-)$ plotted over steering budget $k$, for the six models on the dimension $i=\,\,$ends-justify-means.
  • Figure 5: The 32 persona dimensions we study in our persona steerability benchmark. The listed dimensions are the subset of the (133) dimensions from the anthropic-evals dataset that contain at least 300 examples (in each direction) with at least 0.85 label confidence. Dimensions are categorized into the eight categories from perez2022discovering.
  • ...and 8 more figures