Table of Contents
Fetching ...

Spectral Editing of Activations for Large Language Model Alignment

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

TL;DR

A novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations while minimising covariance with the negative demonstrations.

Abstract

Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as computation and data efficiency. We also show that SEA editing only has a limited negative impact on other model capabilities.

Spectral Editing of Activations for Large Language Model Alignment

TL;DR

A novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations while minimising covariance with the negative demonstrations.

Abstract

Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as computation and data efficiency. We also show that SEA editing only has a limited negative impact on other model capabilities.
Paper Structure (37 sections, 5 equations, 8 figures, 11 tables)

This paper contains 37 sections, 5 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: t-SNE plot of LLaMA-2-chat-7B's activations for positive (blue) and negative (red) demonstrations from HaluEval and BBQ.
  • Figure 2: An overview of Spectral Editing of Activations (SEA). The method consists of two stages: (Left) the offline calculation of the editing projections using spectral decomposition with positive, negative and neutral demonstrations. (Right) the application of the calculated editing projections during LLM inference, thus manipulating predictions.
  • Figure 3: Left: Accuracy of all methods on each group of bias type in BBQ. Right: results on BBQ's testing set. All methods are applied on LLaMA-2-Chat-7B. For accuracy (A%$_{\uparrow}$), higher values are better; for unknown-answer response rate (U%$_{\downarrow}$), bias score (BS%$_{\downarrow}$) and stereotypical response rate (SR%$_{\downarrow}$), lower is better. We use bold font for the best result in each column, and mark the methods that improve ICL. $\dag$: significant improvements on ICL in A% by pair-wise t-test with $p < 0.05$.
  • Figure 4: MC1 scores of SEA by using a different number of demonstrations. A higher score indicates a better performance. We find that SEA can start to positively improve the baseline with only 25 demonstrations on both TruthfulQA and BBQ for the 7B LLaMA-2-Chat model.
  • Figure 5: Visualisation for the signature values in all LLM layers on HaluEval.
  • ...and 3 more figures