Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Dan Zhang, Difan Zou, Yisong Yue, Ziniu Hu
TL;DR
SelfControl introduces an inference-time gradient-based framework that uses a model’s own self-evaluation of a natural-language suffix to steer LLM behavior without parameter updates. By computing the suffix score $S_{\text{suffix}}$ and its gradient $\Delta H=\nabla_H S_{\text{suffix}}$, it iteratively updates latent inputs to maximize alignment with the desired attribute, and extends this with SelfControl$_{\textsc{prefix}}$ via a PrefixController for efficient, composable control. Empirical results across detoxification, privacy protection, emotion control, HH-dialogue, reasoning, and truthfulness demonstrate substantial gains over SOTA on several tasks, while PrefixController offers near-zero-latency, multi-attribute control and data-synthesis potential. The approach reduces reliance on human-annotated data and provides a transparent, adaptable mechanism for on-the-fly, inference-time alignment of LLM behavior with user intentions.
Abstract
We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing control for multiple behaviors simultaneously. Our experiments demonstrate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%~10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue. Additionally, we demonstrate that SelfControl can be used for data synthesis and to improve reasoning abilities.
