Table of Contents
Fetching ...

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens

TL;DR

This work reframes steerability as a multi-dimensional goal-space alignment problem, highlighting that evaluating LLMs with scalar metrics masks miscalibration and side effects across several text attributes. It introduces a steerability framework that maps user goals and outputs to a shared vector space and decomposes steering errors into miscalibration and orthogonality, enabling uniform goal sampling via a steerability probe. Empirical results show that current LLMs exhibit pervasive side effects and goal-dimension entanglement, with prompt engineering offering limited gains, best-of-$N$ sampling being costly, and RL fine-tuning providing partial progress toward steerability. The authors provide open-source tooling for steerability evaluation and argue that achieving robust steerability requires alignment approaches beyond inference-time tweaks, with RL-based strategies showing the most promise among those tested.

Abstract

Despite advances in large language models (LLMs) on reasoning and instruction-following tasks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. Two gaps in current LLM evaluation impede steerability evaluation: (1) many benchmarks are built with past LLM chats and Internet-scraped text, which may skew towards common requests, and (2) scalar measures of performance common in prior work could conceal behavioral shifts in LLM outputs in open-ended generation. Thus, we introduce a framework based on a multi-dimensional goal-space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce unintended changes or side effects to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness but side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

TL;DR

This work reframes steerability as a multi-dimensional goal-space alignment problem, highlighting that evaluating LLMs with scalar metrics masks miscalibration and side effects across several text attributes. It introduces a steerability framework that maps user goals and outputs to a shared vector space and decomposes steering errors into miscalibration and orthogonality, enabling uniform goal sampling via a steerability probe. Empirical results show that current LLMs exhibit pervasive side effects and goal-dimension entanglement, with prompt engineering offering limited gains, best-of- sampling being costly, and RL fine-tuning providing partial progress toward steerability. The authors provide open-source tooling for steerability evaluation and argue that achieving robust steerability requires alignment approaches beyond inference-time tweaks, with RL-based strategies showing the most promise among those tested.

Abstract

Despite advances in large language models (LLMs) on reasoning and instruction-following tasks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. Two gaps in current LLM evaluation impede steerability evaluation: (1) many benchmarks are built with past LLM chats and Internet-scraped text, which may skew towards common requests, and (2) scalar measures of performance common in prior work could conceal behavioral shifts in LLM outputs in open-ended generation. Thus, we introduce a framework based on a multi-dimensional goal-space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce unintended changes or side effects to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness but side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.

Paper Structure

This paper contains 95 sections, 16 equations, 28 figures, 12 tables.

Figures (28)

  • Figure 1: Steerability metrics in 2D goal-space (reading level & text length). A user aims to rewrite text according to some intent, expressed via a prompt (Make this harder to read...). The steering error (red dotted line) is the gap between the user's intent (blue) and the LLM's output (red). Miscalibration (miscal.) and orthogonality (ortho.) decompose steering error into components parallel and orthogonal to user intent respectively.
  • Figure 2: Median (IQR) of steering error (left), miscalibration (middle), and orthogonality, Llama3 family. Caps denote empirical 95% CI with outliers ($\circ$) plotted individually. Steering error does not improve with model size (left), but miscalibration does (middle). Orthogonality drops slightly (right), but remains skewed away from 0.
  • Figure 3: Vector flow of goal-space movement (blue), Llama3.3-70B, in requests to change reading difficulty but not formality. Horizontal movement is desired, but not vertical movement. Source texts in red.
  • Figure 4: Median and IQR steerability, Llama3.3-70B, in correlated (darker) vs. anti-correlated (lighter) requests for change in reading difficulty and formality. Caps denote empirical 95% CI with outliers ($\circ$) plotted individually. Llama3.3-70B struggles more with anti-correlated changes.
  • Figure 5: Median and IQR of steering error (left), miscalibration (middle), and orthogonality (right) of Llama3.1-8B across prompting strategies. Caps denote empirical 95% CI with outliers ($\circ$) plotted individually. More detailed prompts and removal of the negative prompt marginally improve miscalibration over the default. However, side effects remain severe.
  • ...and 23 more figures