How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu; Kewei Xu; Haoming Xu; Haiwen Hong; Longtao Huang; Hui Xue; Ningyu Zhang; Yongliang Shen; Guozhou Zheng; Huajun Chen; Shumin Deng

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng

TL;DR

SteerEval is introduced, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality, which offers a principled and interpretable framework for safe and controllable LLM behavior.

Abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

TL;DR

Abstract

Paper Structure (48 sections, 1 equation, 7 figures, 10 tables)

This paper contains 48 sections, 1 equation, 7 figures, 10 tables.

Introduction
Preliminary
Steering Task
Existing Benchmark
Hierarchical Control in Cognition
Hierarchical Steering Benchmark
Design Principles
Granularity Hierarchy Design
Level 1 (L1) Computational Level.
Level 2 (L2) Algorithmic Level.
Level 3 (L3) Implementational Level.
Automated Data Synthesis Pipeline
Hierarchical Concept Synthesis.
Question Generation and Refine.
Paired Answer Generation.
...and 33 more sections

Figures (7)

Figure 1: Behavioral control targets can be organized by granularity. For example, the target of autonomy progresses from a high-level objective (Level 1), to a constrained manner of expression (Level 2), and finally to a directly checkable surface realization (Level 3).
Figure 2: Example cases from the three domains Personality, Sentiment, and Language Features across the L1$\sim$L3 hierarchies. Taking Language Features as an example, the core steering goal is to increase redundancy. At Level 1 (L1), the model is guided to express the general intent "Increase redundancy", shifting from "Concise phrasing" to "Elaborative repetition". At Level 2 (L2), the steering specifies a strategy for realization, moving from a "Single expression" to a "Rephrased restatement". At Level 3 (L3), atomic, verifiable markers are enforced, requiring the inclusion of "(i.e.,". These examples illustrate how each level progressively constrains model outputs from abstract intent to concrete surface evidence. Further details are provided in §\ref{['sec:benchmark_construction']}.
Figure 3: Automated data synthesis pipeline.
Figure 4: The hierarchical structure and sample distribution of our dataset.
Figure 5: Experimental results in terms of few-shot analysis and steering strength.
...and 2 more figures

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

TL;DR

Abstract

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Authors

TL;DR

Abstract

Table of Contents

Figures (7)