Classifier-free guidance in LLMs Safety

Roman Smirnov

Classifier-free guidance in LLMs Safety

Roman Smirnov

TL;DR

The paper tackles unlearning in LLMs without retaining data by combining ORPO reinforcement learning with classifier-free guidance during inference. It introduces a four-part forgetting pipeline—model subtraction, synthetic data generation, supervised/DPO-like tuning, and CFG-enhanced inference—facilitated by LoRA adapters. Empirical results show CFG-guided inference can suppress personal-data leakage while preserving MMLU performance, whereas subtraction alone is less effective unless complemented by subsequent LoRA tuning. Overall, the work demonstrates a practical, dataset-free approach to safer LLMs with manageable compute overhead and adaptable inference controls.

Abstract

The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.

Classifier-free guidance in LLMs Safety

TL;DR

Abstract

Classifier-free guidance in LLMs Safety

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)