Table of Contents
Fetching ...

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Dan Zhang, Difan Zou, Yisong Yue, Ziniu Hu

TL;DR

SelfControl introduces an inference-time gradient-based framework that uses a model’s own self-evaluation of a natural-language suffix to steer LLM behavior without parameter updates. By computing the suffix score $S_{\text{suffix}}$ and its gradient $\Delta H=\nabla_H S_{\text{suffix}}$, it iteratively updates latent inputs to maximize alignment with the desired attribute, and extends this with SelfControl$_{\textsc{prefix}}$ via a PrefixController for efficient, composable control. Empirical results across detoxification, privacy protection, emotion control, HH-dialogue, reasoning, and truthfulness demonstrate substantial gains over SOTA on several tasks, while PrefixController offers near-zero-latency, multi-attribute control and data-synthesis potential. The approach reduces reliance on human-annotated data and provides a transparent, adaptable mechanism for on-the-fly, inference-time alignment of LLM behavior with user intentions.

Abstract

We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing control for multiple behaviors simultaneously. Our experiments demonstrate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%~10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue. Additionally, we demonstrate that SelfControl can be used for data synthesis and to improve reasoning abilities.

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

TL;DR

SelfControl introduces an inference-time gradient-based framework that uses a model’s own self-evaluation of a natural-language suffix to steer LLM behavior without parameter updates. By computing the suffix score and its gradient , it iteratively updates latent inputs to maximize alignment with the desired attribute, and extends this with SelfControl via a PrefixController for efficient, composable control. Empirical results across detoxification, privacy protection, emotion control, HH-dialogue, reasoning, and truthfulness demonstrate substantial gains over SOTA on several tasks, while PrefixController offers near-zero-latency, multi-attribute control and data-synthesis potential. The approach reduces reliance on human-annotated data and provides a transparent, adaptable mechanism for on-the-fly, inference-time alignment of LLM behavior with user intentions.

Abstract

We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing control for multiple behaviors simultaneously. Our experiments demonstrate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%~10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue. Additionally, we demonstrate that SelfControl can be used for data synthesis and to improve reasoning abilities.
Paper Structure (34 sections, 5 equations, 6 figures, 9 tables)

This paper contains 34 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Our SelfControl and SelfControl$_{\textsc{prefix}}$ are able to control LLM behaviors, e.g., emotion. With SelfControl, you can obtain the suffix gradient for the desired attribute for precise control, while SelfControl$_{\textsc{prefix}}$ enables the composition of these attributes with PrefixController.
  • Figure 2: Framework of SelfControl. We begin by sampling an initial response from a language model and selecting an appropriate suffix string and a target label to define a control direction. Suffixes can be combined. As shown in the figure, we select "Be Happier" from the suffix pool to define our attribute. Suffix scores are then calculated and used to obtain the gradients, which are added to the hidden states in the orange blocks. These modified hidden states are then used to sample new responses—steps 3 and 4 form an iteration loop, leading to the final controlled response.
  • Figure 3: Training pipeline of SelfControl$_{\textsc{prefix}}$ using PrefixController. PrefixController contains prompts of learnable soft tokens at each layer, including the embedding layer. Specifically, the prompt at the embedding layer is initialized using a neutral human-written prompt. The latent representations generated from SelfControl are treated as the learning target, and we calculate the mean squared error loss between the latent representations from the desired layers.
  • Figure 4: Ablations and study on PrefixController. Left: Varying step-size. Middle: Compositing PrefixController. Right: Scaling training data of PrefixController.
  • Figure 5: How suffix gradients apply per task.
  • ...and 1 more figures