Table of Contents
Fetching ...

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

Madeline Brumley, Joe Kwon, David Krueger, Dmitrii Krasheninnikov, Usman Anwar

TL;DR

A case study comparing the effectiveness of representative vector steering methods from each branch, finding they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision.

Abstract

A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desired behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up" and ``top-down" -- have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for further research into top-down and bottom-up steering given these findings.

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

TL;DR

A case study comparing the effectiveness of representative vector steering methods from each branch, finding they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision.

Abstract

A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desired behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up" and ``top-down" -- have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for further research into top-down and bottom-up steering given these findings.

Paper Structure

This paper contains 19 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Average performance of in-context vectors liu2024incontext and function vectors todd2024function on different classes of in-context learning tasks.
  • Figure 2: Comparative performance of steering methods applied to Llama 2-Chat (7B) across functional and behavioral tasks on the zero-shot setting, averaged across random seeds. ICV results are based on the best-performing ICV for each task. Baseline performance corresponds to clean model performance on the same zero-shot prompt.
  • Figure 3: Comparative performance of steering methods applied to Llama 2-Chat (7B) across functional tasks on various 'OOD' settings, averaged across random seeds. ICV results are based on the best-performing ICV for each task. Both vectors steer reasonably well in contexts similar to the in-context learning prompts from which they were extracted (see Appendix \ref{['sec:experiment']}), but both vectors, to varying degrees, have more difficulty generalizing to more out-of-distribution settings. Details about natural text prompting styles used can be found in Appendix \ref{['sec:prompting_style']}.
  • Figure 4: (a) Generation entropy (GE) scores by vector steering method, averaged across zero-shot tasks. Vector strength is not varied for FVs. (b) The total mean CIE across the top 20 most implicated attention heads for each task. The total mean CIE represents the magnitude of impact the most influential attention heads have for a given task. Total mean CIE is strongly correlated with FV task performance.
  • Figure 5: Changes in overall task performance for FVs and ICVs by intervention location. We experiment with adding FVs to multiple layers, rather than a single layer, as well as adding ICVs at single layers rather than all layers. Note that all changes are relative to average task performance. On functional tasks, all changes in accuracy are negative, indicating no alternative intervention location improved on original FV or ICV task performance. However, for behavioral tasks, FV intervention accuracy at middle layers does show improvement.
  • ...and 1 more figures