Table of Contents
Fetching ...

EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences

Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap

TL;DR

EVALUESTEER introduces a large-scale, synthetic benchmark to evaluate reward-model and LLM steerability toward joint user value and style profiles. By grounding value profiles in the World Values Survey and defining four orthogonal style dimensions, the framework generates 165,888 pairwise preferences across 24 prompts and 288 user profiles, enabling controlled in-context steerability testing under 11 prompting settings. Across six models, results show context improves alignment but models remain ~25 percentage points below an oracle, with consistent secular-value and verbose-style biases and a tendency to favor style over values when conflicts arise. The work highlights critical limitations in current reward-models for pluralistic alignment and provides a challenging testbed for developing value- and style-aware steering in AI systems with broad real-world impact.

Abstract

As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs' and reward models' (RMs) steerability towards users' value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs -- systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user's preferences. We evaluate six open-source and proprietary LLMs and RMs under eleven systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user's full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.

EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences

TL;DR

EVALUESTEER introduces a large-scale, synthetic benchmark to evaluate reward-model and LLM steerability toward joint user value and style profiles. By grounding value profiles in the World Values Survey and defining four orthogonal style dimensions, the framework generates 165,888 pairwise preferences across 24 prompts and 288 user profiles, enabling controlled in-context steerability testing under 11 prompting settings. Across six models, results show context improves alignment but models remain ~25 percentage points below an oracle, with consistent secular-value and verbose-style biases and a tendency to favor style over values when conflicts arise. The work highlights critical limitations in current reward-models for pluralistic alignment and provides a challenging testbed for developing value- and style-aware steering in AI systems with broad real-world impact.

Abstract

As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs' and reward models' (RMs) steerability towards users' value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs -- systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user's preferences. We evaluate six open-source and proprietary LLMs and RMs under eleven systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user's full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.

Paper Structure

This paper contains 44 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: EValueSteer workflow. Figure illustrates a single evaluation instance. (1) Prompt. Value-laden from PRISM kirk2024the (e.g. "Do you consider family important?") is posed to the system. (2) Candidate responses. Two completions: one that is value-aligned with the user profile and one that is value-misaligned; style alignment can vary independently, allowing us to cross value $\times$ style factors. (3) User profile context. The reward model is supplied with a structured summary of the user’s value preferences (e.g. prioritizes family) and style preferences (e.g. favors warm tone). (4) Scoring and selection. Using the prompt, profile, and candidate responses, the reward model selects the response it believes the user prefers.
  • Figure 2: Performance improvements from supplying user context. Bars show the mean pair-wise accuracy (% of preference pairs correctly ranked; higher is better) achieved by the reward model (RM) under five conditions: No Context, Value only context, Style only context, Value + Style context with different priority orders. Bars with cross-overs indicate CoT numbers for LLM-as-a-judge RMs. A dashed horizontal line marks random performance at 0.5. Takeaway: Conditioning on values or styles improves performance, but even the best setting remains over 25% below oracle levels, highlighting the limited steerability of current reward models to pluralistic preferences.
  • Figure 3: Intrinsic Value bias of RMs. Plots on the left show the proportion of times each RM chooses a response aligned with a value related to an Inglehart-Welzel value loading question. Proportions in blue represent Secular/Self-expression responses. Proportions in red represent Tradition/Survival responses. Scatter plot on the right places each of the RMs in the 4 cultural value quadrants defined by inglehart2014world.
  • Figure 4: Intrinsic style bias of RMs. Bars show the proportion of times the styles: verbosity, high confidence, warmth, high reading difficulty, are chosen over the respective opposite style: concise, low confidence, cold, and low reading difficulty. Proportions significantly different from 0.5 when error bars (at 95% CI) do not overlap the dashed reference line. RMs exhibit significant verbosity, high reading difficulty.
  • Figure 5: Value vs Style steering preference in a neutral setting. Bars indicate proportion of times RMs prioritize value (blue) vs style (red). Proportions significantly different from 0.5 when error bars (at 95% CI) do not overlap the dashed reference line. Consistent style over value bias persists across RMs.
  • ...and 4 more figures