A Causal Explainable Guardrails for Large Language Models

Zhixuan Chu; Yan Wang; Longfei Li; Zhibo Wang; Zhan Qin; Kui Ren

A Causal Explainable Guardrails for Large Language Models

Zhixuan Chu, Yan Wang, Longfei Li, Zhibo Wang, Zhan Qin, Kui Ren

TL;DR

This work proposes LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs, and demonstrates its effectiveness in steering LLMs toward desired attributes while mitigating biases.

Abstract

Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs toward desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardrail systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardrail's effectiveness in steering LLMs toward desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes.

A Causal Explainable Guardrails for Large Language Models

TL;DR

Abstract

Paper Structure (31 sections, 7 equations, 8 figures, 8 tables)

This paper contains 31 sections, 7 equations, 8 figures, 8 tables.

Introduction
Background
Representations in Large Language Models
LLM Activation Engineering
Causal Inference
Causal Analysis
The Hypothetical Situation
The Real Situation
Causal Analysis of Our Solutions
Methodology
Intervened Layer Selection
Unbiased Steering Representations
Debias Training Framework
Debiasing Training
Obtaining the Steering Representation
...and 16 more sections

Figures (8)

Figure 1: Proportion of representations of semantic prompts that implicitly encode positive or negative directions for semantically neutral prompts without explicit steering prompt, across different language models. The varying proportions of positive and negative directions learned by the probing classifier, even in the absence of steering prompts, demonstrate the presence of inherent biases in the representations of semantic prompts due to differences in pre-training data. This observation supports the existence of a direct edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representation $R_{+}/R_{-}$, as discussed in the causal analysis section.
Figure 2: The causal analysis of our proposed LLMGuardrail. (a) The Hypothetical situation: The constructed pair of prompts can block the edge from the semantic prompt $C$ to the steering prompt $S$ and the direction representation $R^{+}/R^{-}$. (b) The Real situation: In addition to the edge from the semantic prompt $C$ to the steering prompt $S$, there is a direct edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representation $R^{+}/R^{-}$. (c) Ours: We need to block the edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representation $R^{+}/R^{-}$. The direction representation $R^{+}/R^{-}$ is only influenced by the steering prompt $S^{+}/S^{-}$, enabling us to obtain an unbiased steering representation $\Delta R$ through steering engineering.
Figure 3: The causal graph and backdoor adjustment.
Figure 4: The framework of LLMGuardrail, which is a plug-and-play algorithmic framework designed to obtain the unbiased steering representation for LLMs while seamlessly integrating with their existing architecture. It consists of (1) Intervened Layer Selection: selecting layers based on probing accuracy. (2) Debias Training Framework: including a Debias LoRA Block that replaces the original intermediate state with the debiased intermediate state, and a Domain Probing module implemented as a multi-layer perceptron (MLP) for adversarial learning. The training process optimizes both prediction reconstruction loss $\mathcal{L}_{pre}$ and debias loss $\mathcal{L}_{debias}$. (3) Inference Stage: applying the learned unbiased steering representations to control the LLM's output using a projection operation.
Figure 5: The examples of the prefix steering prompt sets, and the original and intervened outputs by our LLMGuardrail with explainable shading.
...and 3 more figures

Theorems & Definitions (4)

Definition 3.1: Direction Representation $R_{+/-}$
Definition 3.2: Semantic Context Representation $R^{cy}$
Definition 3.3: Semantic Direction Representation $R^{cd}$
Definition 3.4: Steering Representation $\Delta R$

A Causal Explainable Guardrails for Large Language Models

TL;DR

Abstract

A Causal Explainable Guardrails for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (4)