Evaluating Steering Techniques using Human Similarity Judgments
Zach Studdiford, Timothy T. Rogers, Siddharth Suresh, Kushin Mukherjee
TL;DR
This study investigates how different LLM steering methods align with human cognition by using a triadic similarity task on the Round Things Dataset. By comparing prompting, task vectors, DiffMean, and SAEs across two Gemma models, the work assesses both competence (accuracy) and alignment (Procrustes $r^2$) with human judgments. Results show prompting generally yields the best ground-truth predictions, but none of the methods produce embeddings that closely match human representations, with size judgments particularly misaligned and a prior bias toward 'kind' in both humans and models. The findings highlight a need to evaluate steering methods not only on task performance but also on cognitive alignment, and they point to prompting as the most effective current approach while revealing fundamental differences in how humans and LLMs organize semantic knowledge.
Abstract
Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.
