A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens
TL;DR
This work reframes steerability as a multi-dimensional goal-space alignment problem, highlighting that evaluating LLMs with scalar metrics masks miscalibration and side effects across several text attributes. It introduces a steerability framework that maps user goals and outputs to a shared vector space and decomposes steering errors into miscalibration and orthogonality, enabling uniform goal sampling via a steerability probe. Empirical results show that current LLMs exhibit pervasive side effects and goal-dimension entanglement, with prompt engineering offering limited gains, best-of-$N$ sampling being costly, and RL fine-tuning providing partial progress toward steerability. The authors provide open-source tooling for steerability evaluation and argue that achieving robust steerability requires alignment approaches beyond inference-time tweaks, with RL-based strategies showing the most promise among those tested.
Abstract
Despite advances in large language models (LLMs) on reasoning and instruction-following tasks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. Two gaps in current LLM evaluation impede steerability evaluation: (1) many benchmarks are built with past LLM chats and Internet-scraped text, which may skew towards common requests, and (2) scalar measures of performance common in prior work could conceal behavioral shifts in LLM outputs in open-ended generation. Thus, we introduce a framework based on a multi-dimensional goal-space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce unintended changes or side effects to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness but side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.
