Table of Contents
Fetching ...

Evaluating Steering Techniques using Human Similarity Judgments

Zach Studdiford, Timothy T. Rogers, Siddharth Suresh, Kushin Mukherjee

TL;DR

This study investigates how different LLM steering methods align with human cognition by using a triadic similarity task on the Round Things Dataset. By comparing prompting, task vectors, DiffMean, and SAEs across two Gemma models, the work assesses both competence (accuracy) and alignment (Procrustes $r^2$) with human judgments. Results show prompting generally yields the best ground-truth predictions, but none of the methods produce embeddings that closely match human representations, with size judgments particularly misaligned and a prior bias toward 'kind' in both humans and models. The findings highlight a need to evaluate steering methods not only on task performance but also on cognitive alignment, and they point to prompting as the most effective current approach while revealing fundamental differences in how humans and LLMs organize semantic knowledge.

Abstract

Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

Evaluating Steering Techniques using Human Similarity Judgments

TL;DR

This study investigates how different LLM steering methods align with human cognition by using a triadic similarity task on the Round Things Dataset. By comparing prompting, task vectors, DiffMean, and SAEs across two Gemma models, the work assesses both competence (accuracy) and alignment (Procrustes ) with human judgments. Results show prompting generally yields the best ground-truth predictions, but none of the methods produce embeddings that closely match human representations, with size judgments particularly misaligned and a prior bias toward 'kind' in both humans and models. The findings highlight a need to evaluate steering methods not only on task performance but also on cognitive alignment, and they point to prompting as the most effective current approach while revealing fundamental differences in how humans and LLMs organize semantic knowledge.

Abstract

Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

Paper Structure

This paper contains 18 sections, 1 equation, 14 figures.

Figures (14)

  • Figure 1: Representational geometry of the concepts in the Round Things Dataset based on embeddings derived from triadic judgments for humans and gemma-9b-it. The top row consists of embeddings derived from human similarity judgments based on kind and size. The facets below show corresponding embeddings derived from steered LLMs under different techniques. Plots of all embeddings can be seen in Figure 5.
  • Figure 2: Steering accuracy (top row) and alignment of steered LLM representations to human representations (bottom row) for each steering technique. The dashed line labeled 'human embeddings' corresponds to how accurately human judgments can be predicted from human embeddings. We only report SAE results for gemma-9b-it.
  • Figure 3: Overview of the triadic judgment task in humans (left) and LLM steering methods (right). Embeddings of concepts are derived based on similarity judgments from both humans and LLMs and compared in terms of their representational geometry. Plots are illustrative and do not reflect real data.
  • Figure 4: Full procrustes correlations for all methods
  • Figure :
  • ...and 9 more figures