EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences
Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap
TL;DR
EVALUESTEER introduces a large-scale, synthetic benchmark to evaluate reward-model and LLM steerability toward joint user value and style profiles. By grounding value profiles in the World Values Survey and defining four orthogonal style dimensions, the framework generates 165,888 pairwise preferences across 24 prompts and 288 user profiles, enabling controlled in-context steerability testing under 11 prompting settings. Across six models, results show context improves alignment but models remain ~25 percentage points below an oracle, with consistent secular-value and verbose-style biases and a tendency to favor style over values when conflicts arise. The work highlights critical limitations in current reward-models for pluralistic alignment and provides a challenging testbed for developing value- and style-aware steering in AI systems with broad real-world impact.
Abstract
As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs' and reward models' (RMs) steerability towards users' value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs -- systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user's preferences. We evaluate six open-source and proprietary LLMs and RMs under eleven systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user's full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.
