PaCE: Parsimonious Concept Engineering for Large Language Models

Jinqi Luo; Tianjiao Ding; Kwan Ho Ryan Chan; Darshan Thaker; Aditya Chattopadhyay; Chris Callison-Burch; René Vidal

PaCE: Parsimonious Concept Engineering for Large Language Models

Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal

TL;DR

Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment, is proposed and it is shown that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

Abstract

Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable outputs via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Then, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activations as linear combinations of benign and undesirable components. By removing the latter ones from the activations, we reorient the behavior of the LLM towards the alignment goal. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

PaCE: Parsimonious Concept Engineering for Large Language Models

TL;DR

Abstract

Paper Structure (34 sections, 2 theorems, 12 equations, 19 figures, 6 tables, 3 algorithms)

This paper contains 34 sections, 2 theorems, 12 equations, 19 figures, 6 tables, 3 algorithms.

Introduction
Basics of Latent Space Engineering
The Latent Space and Its Linear Controllability
Controlling Language Models via Latent Space Engineering
Our Method: Parsimonious Concept Engineering
Activation Intervention via Overcomplete Oblique Projection
Knowledge-Driven Concept Dictionary
Overcomplete Oblique Projection via Sparse Coding
Experimental Results
Improving Safety by Response Detoxification
Improving Faithfulness and Removing Negative Sentiment
Representation Space Sampled by PaCE-1M
Discussion
Polysemy of Words
Different Alignment Paradigms
...and 19 more sections

Key Result

proposition 1

Let ${\boldsymbol{D}}\in {\mathbb{R}}^{d\times n}$ be a dictionary matrix and ${\boldsymbol{z}}\in{\mathbb{R}}^d$ a latent code. Then, any solution ${\boldsymbol{c}}^*$ of the optimization problem satisfies ${\boldsymbol{D}} {\boldsymbol{c}}^* = \Pi_{\mathop{\mathrm{range}}\nolimits({\boldsymbol{D}})}{\boldsymbol{z}}$. Therefore, the map ${\boldsymbol{z}} \mapsto {\boldsymbol{z}} - {\boldsymbol{D

Figures (19)

Figure 1: Our framework PaCE achieves alignment goals by sparse coding and adjusting vectors in the activation space of the LLM Decoder Layer (DCL).
Figure 2: To remove a concept direction 'red' from the latent code 'red apple' (left), prior works use i) orthogonal projection (middle right, \ref{['eq:ortho-projection']}), which may remove extra directions, or ii) vector addition (right, \ref{['eq:vector-addition']}), where it is hard to pick the edit strength $c$. Instead, PaCE explicitly models the concept dictionary in the latent space and use oblique projection (middle left).
Figure 3: Pipeline of PaCE has several major steps: Step 1 collects concept vectors and constructs the concept dictionary, Step 2 decomposes the activation vector of the given input by sparse coding to get concept coefficients, and Step 3 performs editing on the concepts towards reoriented response.
Figure 4: Examples of the constructed concepts and their partition for the detoxification task sampled from our PaCE-1M.
Figure 5: An example of jailbreaking LLaMA2-7B-Chat and detoxification by PaCE. PaCE successfully detoxifies the response while maintaining the instruction-following capability.
...and 14 more figures

Theorems & Definitions (7)

remark 1: ${\mathcal{Z}}=$ Word Embeddings
remark 2: ${\mathcal{Z}}=$ Neural Activations
remark 3: (\ref{['eq:ortho-projection']}, \ref{['eq:vector-addition']}) $=$ Special Cases of Oblique Projection
proposition 1
proof
proposition 2
proof

PaCE: Parsimonious Concept Engineering for Large Language Models

TL;DR

Abstract

PaCE: Parsimonious Concept Engineering for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (7)