Table of Contents
Fetching ...

How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

Antonio-Gabriel Chacón Menke, Phan Xuan Tan

TL;DR

This work evaluates Constitutional AI's self-critique mechanism on small uncensored LLMs (7–9B) to assess harm reduction via internal critique and rewriting. Using a three-step CAI pipeline and abliteration across four architectures, the study finds that Llama-based models achieve the most robust harm reduction, while Gemma-2 and Qwen-2.5 show limited gains, indicating architecture-dependent effectiveness. Abliteration reveals that removing refusal tendencies can differently affect safety performance across architectures, with prompt format further modulating harm detection. The findings support CAI as a viable self-alignment technique for smaller models but emphasize tailoring prompts to model architecture and reasoning style to maximize safety benefits.

Abstract

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures demonstrated less improvement in harm detection after abliteration. These results suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

TL;DR

This work evaluates Constitutional AI's self-critique mechanism on small uncensored LLMs (7–9B) to assess harm reduction via internal critique and rewriting. Using a three-step CAI pipeline and abliteration across four architectures, the study finds that Llama-based models achieve the most robust harm reduction, while Gemma-2 and Qwen-2.5 show limited gains, indicating architecture-dependent effectiveness. Abliteration reveals that removing refusal tendencies can differently affect safety performance across architectures, with prompt format further modulating harm detection. The findings support CAI as a viable self-alignment technique for smaller models but emphasize tailoring prompts to model architecture and reasoning style to maximize safety benefits.

Abstract

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures demonstrated less improvement in harm detection after abliteration. These results suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

Paper Structure

This paper contains 6 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Comparison of harmful response rates across different categories for various models.