Table of Contents
Fetching ...

Control Barrier Function for Aligning Large Language Models

Yuya Miyaoka, Masaki Inoue

TL;DR

This work introduces a control barrier function (CBF)–based safety filter, applied as an add-on between a baseline LLM's token predictor and selector, to steer text generation toward user-desired content without fine-tuning the model. It defines a language-constraint function (L-CF) using RoBERTa sentiment scores to measure desirability and imposes a discrete-time CBF constraint that minimizes KL divergence $D_{KL}[Q||P]$ when redistributing token probabilities, enabling minimal intervention. The approach is implemented with open-source models (Llama 3 and RoBERTa) and extended to multi-step ahead decoding to reduce conservativeness, demonstrating improved naturalness and positiveness while maintaining safety across experiments. The framework offers a flexible, reusable alignment tool that can adapt to new requirements by swapping or layering CBF filters, without retraining the LLM, and highlights practical trade-offs between safety, quality, and computation.

Abstract

This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.

Control Barrier Function for Aligning Large Language Models

TL;DR

This work introduces a control barrier function (CBF)–based safety filter, applied as an add-on between a baseline LLM's token predictor and selector, to steer text generation toward user-desired content without fine-tuning the model. It defines a language-constraint function (L-CF) using RoBERTa sentiment scores to measure desirability and imposes a discrete-time CBF constraint that minimizes KL divergence when redistributing token probabilities, enabling minimal intervention. The approach is implemented with open-source models (Llama 3 and RoBERTa) and extended to multi-step ahead decoding to reduce conservativeness, demonstrating improved naturalness and positiveness while maintaining safety across experiments. The framework offers a flexible, reusable alignment tool that can adapt to new requirements by swapping or layering CBF filters, without retraining the LLM, and highlights practical trade-offs between safety, quality, and computation.

Abstract

This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.

Paper Structure

This paper contains 15 sections, 2 theorems, 18 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Suppose that $x(\tau_0)\in\mathcal{S}$, and the control input $u = u(\tau)$ satisfies the CBF constraint E:P.CBFConstraint for all $\tau\ge\tau_0$. Then, the state $x(\tau)\in\mathcal{S}$ holds for all $\tau \ge \tau_0$.

Figures (5)

  • Figure 1: Safe Control of LLM. Top: Collision avoidance in a vehicle control system, Bottom: Collision avoidance in text-generation by LLMs.
  • Figure 2: Text-generation system
  • Figure 3: Proposed text-generation system with CBF filter (CBF-LLM)
  • Figure 4: L-CF trajectory of each filter
  • Figure 5: Predicted L-CF trajectory

Theorems & Definitions (5)

  • Theorem 1: Ames19
  • Example 1
  • Example 2
  • Remark 1
  • Theorem 2