Table of Contents
Fetching ...

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng

TL;DR

SteerEval is introduced, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality, which offers a principled and interpretable framework for safe and controllable LLM behavior.

Abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

TL;DR

SteerEval is introduced, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality, which offers a principled and interpretable framework for safe and controllable LLM behavior.

Abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
Paper Structure (48 sections, 1 equation, 7 figures, 10 tables)

This paper contains 48 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Behavioral control targets can be organized by granularity. For example, the target of autonomy progresses from a high-level objective (Level 1), to a constrained manner of expression (Level 2), and finally to a directly checkable surface realization (Level 3).
  • Figure 2: Example cases from the three domains Personality, Sentiment, and Language Features across the L1$\sim$L3 hierarchies. Taking Language Features as an example, the core steering goal is to increase redundancy. At Level 1 (L1), the model is guided to express the general intent "Increase redundancy", shifting from "Concise phrasing" to "Elaborative repetition". At Level 2 (L2), the steering specifies a strategy for realization, moving from a "Single expression" to a "Rephrased restatement". At Level 3 (L3), atomic, verifiable markers are enforced, requiring the inclusion of "(i.e.,". These examples illustrate how each level progressively constrains model outputs from abstract intent to concrete surface evidence. Further details are provided in §\ref{['sec:benchmark_construction']}.
  • Figure 3: Automated data synthesis pipeline.
  • Figure 4: The hierarchical structure and sample distribution of our dataset.
  • Figure 5: Experimental results in terms of few-shot analysis and steering strength.
  • ...and 2 more figures