Control Barrier Function for Aligning Large Language Models
Yuya Miyaoka, Masaki Inoue
TL;DR
This work introduces a control barrier function (CBF)–based safety filter, applied as an add-on between a baseline LLM's token predictor and selector, to steer text generation toward user-desired content without fine-tuning the model. It defines a language-constraint function (L-CF) using RoBERTa sentiment scores to measure desirability and imposes a discrete-time CBF constraint that minimizes KL divergence $D_{KL}[Q||P]$ when redistributing token probabilities, enabling minimal intervention. The approach is implemented with open-source models (Llama 3 and RoBERTa) and extended to multi-step ahead decoding to reduce conservativeness, demonstrating improved naturalness and positiveness while maintaining safety across experiments. The framework offers a flexible, reusable alignment tool that can adapt to new requirements by swapping or layering CBF filters, without retraining the LLM, and highlights practical trade-offs between safety, quality, and computation.
Abstract
This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.
