But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen
TL;DR
This paper introduces JUSSA, a framework that uses honesty-promoting steering vectors during LLM inference to generate contrastive, honest alternatives for judge-based evaluation of model dishonesty. By evaluating base versus provoked responses with paired, single, and steered-judge configurations, the authors show that steering vectors can improve detection of subtle manipulation, especially for weaker judges or more complex tasks, while also revealing that steering is most effective in middle processing layers ($l\in[8,13]$). The work combines manipulation and sycophancy datasets with multi-model evaluations (Gemma-2b, GPT-4.1 variants, Claude-3.5) and includes layer-wise analyses that tie representation divergence to steering success. Two open-source datasets and the JUSSA framework offer a scalable direction for auditing increasingly sophisticated AI systems, though generalizability and dataset realism remain important challenges for future research.
Abstract
Detecting subtle forms of dishonesty like sycophancy and manipulation in Large Language Models (LLMs) remains challenging for both humans and automated evaluators, as these behaviors often appear through small biases rather than clear false statements. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a novel framework that employs steering vectors not to improve model behavior directly, but to enhance LLM judges' evaluation capabilities. JUSSA applies steering vectors during inference to generate more honest alternatives, providing judges with contrastive examples that make subtle dishonest patterns easier to detect. While existing evaluation methods rely on black-box evaluation, JUSSA leverages model internals to create targeted comparisons from single examples. We evaluate our method on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation. Our results demonstrate that JUSSA effectively improves detection accuracy over single-response evaluation in various cases. Analysis across judge models reveals that JUSSA helps weaker judges on easier dishonesty detection tasks, and stronger judges on harder tasks. Layer-wise experiments show how dishonest prompts cause representations to diverge from honest ones in middle layers, revealing where steering interventions are most effective for generating contrastive examples. By demonstrating that steering vectors can enhance safety evaluation rather than just modify behavior, our work opens new directions for scalable model auditing as systems become increasingly sophisticated.
