Table of Contents
Fetching ...

Classifier-free guidance in LLMs Safety

Roman Smirnov

TL;DR

The paper tackles unlearning in LLMs without retaining data by combining ORPO reinforcement learning with classifier-free guidance during inference. It introduces a four-part forgetting pipeline—model subtraction, synthetic data generation, supervised/DPO-like tuning, and CFG-enhanced inference—facilitated by LoRA adapters. Empirical results show CFG-guided inference can suppress personal-data leakage while preserving MMLU performance, whereas subtraction alone is less effective unless complemented by subsequent LoRA tuning. Overall, the work demonstrates a practical, dataset-free approach to safer LLMs with manageable compute overhead and adaptable inference controls.

Abstract

The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.

Classifier-free guidance in LLMs Safety

TL;DR

The paper tackles unlearning in LLMs without retaining data by combining ORPO reinforcement learning with classifier-free guidance during inference. It introduces a four-part forgetting pipeline—model subtraction, synthetic data generation, supervised/DPO-like tuning, and CFG-enhanced inference—facilitated by LoRA adapters. Empirical results show CFG-guided inference can suppress personal-data leakage while preserving MMLU performance, whereas subtraction alone is less effective unless complemented by subsequent LoRA tuning. Overall, the work demonstrates a practical, dataset-free approach to safer LLMs with manageable compute overhead and adaptable inference controls.

Abstract

The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: $z=log(x)-log(y)$ function definition area
  • Figure 2: $z=x-y$ function definition area
  • Figure 3: Model-ch-lora-cfg train phases logs
  • Figure 4: Model-ch-lora-cfg eval phases logs