Table of Contents
Fetching ...

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye

TL;DR

By delving into the intrinsic mechanisms of LLMs, this work manages to identify and pinpoint components that are closely related to specific tasks within the model, i.e., attention heads, that allow for near-independent control over different tasks.

Abstract

As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this issue through ``Sparse Activation Control''. By delving into the intrinsic mechanisms of LLMs, we manage to identify and pinpoint components that are closely related to specific tasks within the model, i.e., attention heads. These heads display sparse characteristics that allow for near-independent control over different tasks. Our experiments, conducted on the open-source Llama series models, have yielded encouraging results. The models were able to align with human preferences on issues of safety, factuality, and bias concurrently.

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

TL;DR

By delving into the intrinsic mechanisms of LLMs, this work manages to identify and pinpoint components that are closely related to specific tasks within the model, i.e., attention heads, that allow for near-independent control over different tasks.

Abstract

As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this issue through ``Sparse Activation Control''. By delving into the intrinsic mechanisms of LLMs, we manage to identify and pinpoint components that are closely related to specific tasks within the model, i.e., attention heads. These heads display sparse characteristics that allow for near-independent control over different tasks. Our experiments, conducted on the open-source Llama series models, have yielded encouraging results. The models were able to align with human preferences on issues of safety, factuality, and bias concurrently.

Paper Structure

This paper contains 44 sections, 4 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Left. Control conflict of representation engineering for multiple tasks, i.e., the performance of single control consistently increases while the simultaneous control of multiple behaviors decreases on all tasks. Right. Sparsity and uniqueness of related components in LLMs for different behaviors, i.e., the corresponding heads for different tasks are sparse and independent.
  • Figure 2: Left: The variance ratio of the top-10 components of the top 10 layers that have the highest classification accuracy. Right: We collect the output activation from one head being controlled, and we plot the Gaussian distribution results after T-SNE clustering. The blue and red dots represent the distribution of activations of $X_r$ and $X_c$ samples.
  • Figure 3: A case illustration of the method "path patching". It measures the importance of forward paths (i.e., the red lines that originate from Head $0.31$ to Output) for the two-layer transformer in completing the task on reference data.
  • Figure 4: The Directed Acyclic Graph (DAG) for a three-layer transformer.
  • Figure 5: Path patching result of exaggerated safety
  • ...and 6 more figures