Controlling Large Language Models Through Concept Activation Vectors

Hanyu Zhang; Xiting Wang; Chengao Li; Xiang Ao; Qing He

Controlling Large Language Models Through Concept Activation Vectors

Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He

TL;DR

This work tackles the challenge of safely and flexibly controlling large language models without expensive fine-tuning. It introduces Generation with Concept Activation Vectors (GCAV), which learns per-concept activation directions (CAVs) from contrastive prompts and steers LLM hidden activations during inference to modulate specific attributes. The approach supports single- and multi-concept control (toxicity, sentiment, topic, linguistic style) with per-sample, per-layer granularity and closed-form steeringStrength calculations, while remaining lightweight and model-agnostic. Empirical results show GCAV achieves strong control performance with preserved fluency and generalizes across diverse tasks, demonstrating a scalable path for aligning LLM outputs to user-defined concepts and safety goals.

Abstract

As large language models (LLMs) are widely deployed across various domains, the ability to control their generated outputs has become more critical. This control involves aligning LLMs outputs with human values and ethical principles or customizing LLMs on specific topics or styles for individual users. Existing controlled generation methods either require significant computational resources and extensive trial-and-error or provide coarse-grained control. In this paper, we propose Generation with Concept Activation Vector (GCAV), a lightweight model control framework that ensures accurate control without requiring resource-extensive fine-tuning. Specifically, GCAV first trains a concept activation vector for specified concepts to be controlled, such as toxicity. During inference, GCAV steers the concept vector in LLMs, for example, by removing the toxicity concept vector from the activation layers. Control experiments from different perspectives, including toxicity reduction, sentiment control, linguistic style, and topic control, demonstrate that our framework achieves state-of-the-art performance with granular control, allowing for fine-grained adjustments of both the steering layers and the steering magnitudes for individual samples.

Controlling Large Language Models Through Concept Activation Vectors

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 4 figures, 5 tables)

This paper contains 22 sections, 9 equations, 4 figures, 5 tables.

Introduction
Related Work
Controlled Text Generation.
Activation Engineering.
Concept Activation Vector.
GCAV Framework
CAV Training
Controlled Generation
Controlling Multiple Concepts
Evaluation
Baselines
Criteria
Controlling A Single Concept
Toxic reduction
Sentiment control
...and 7 more sections

Figures (4)

Figure 1: CAV Training (left): For a given concept, such as toxicity, we construct contrastive prompts that guide the LLM to generate toxic and safe outputs. Next, we collect the activation vectors after each LLM layer and use a classifier to distinguish these two classes of activation vectors. The normal direction vector of the classifier represents the learned Concept Activation Vector (CAV). Controlled Generation (right): For any toxic input, we select specific LLM layers and steer the learned CAV to these layers with a calculated strength, thereby controlling the LLM generation.
Figure 2: The control effects of three concepts as the topic control strength increases while the control strengths of the other two concepts are fixed. The red line represents the topic control strength. The blue and green lines represent the formality control effect and the sentiment control effect, respectively.
Figure 3: The red line represents the test accuracy of CAVs of each layer. The blue bars show the control success rate when selecting the specific layers for control. There is alignment between the two after the fifth layer.
Figure 4: The distribution between the steering strength calculated in GCAV and the prompt toxicity. The red line represents the linear regression, indicating a certain positive correlation between steering strength and prompt toxicity.

Controlling Large Language Models Through Concept Activation Vectors

TL;DR

Abstract

Controlling Large Language Models Through Concept Activation Vectors

Authors

TL;DR

Abstract

Table of Contents

Figures (4)