Table of Contents
Fetching ...

Bridging the Gap Between Preference Alignment and Machine Unlearning

Xiaohua Feng, Yuyuan Li, Huwei Ji, Jiaming Zhang, Li Zhang, Tianyu Du, Chaochao Chen

TL;DR

This work addresses the high data and compute costs of traditional PA via RLHF by linking PA with machine unlearning through a bi-level optimization framework. It introduces Unlearning to Align (U2A), a sample-weighted unlearning approach that selectively unlearns negative examples with optimized weights to maximize PA performance. The authors provide theoretical insights, including an implicit-gradient-based derivation and convergence guarantees, and validate the approach with extensive experiments across three PA tasks, showing that U2A can significantly improve PA efficiency and effectiveness while reducing training costs. The results suggest a practical path for resource-constrained PA, offering a principled method to choose which negatives to unlearn and how much to unlearn them while preserving model utility.

Abstract

Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness.

Bridging the Gap Between Preference Alignment and Machine Unlearning

TL;DR

This work addresses the high data and compute costs of traditional PA via RLHF by linking PA with machine unlearning through a bi-level optimization framework. It introduces Unlearning to Align (U2A), a sample-weighted unlearning approach that selectively unlearns negative examples with optimized weights to maximize PA performance. The authors provide theoretical insights, including an implicit-gradient-based derivation and convergence guarantees, and validate the approach with extensive experiments across three PA tasks, showing that U2A can significantly improve PA efficiency and effectiveness while reducing training costs. The results suggest a practical path for resource-constrained PA, offering a principled method to choose which negatives to unlearn and how much to unlearn them while preserving model utility.

Abstract

Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness.

Paper Structure

This paper contains 37 sections, 3 theorems, 37 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Proposition 4.2

If Assumptions ass:1 holds, the change in PA performance for a model with parameters $\boldsymbol{\theta}^{*}$ after unlearning sample $\boldsymbol{x}$ using unlearning weight $\boldsymbol{\omega}$ satisfies:

Figures (2)

  • Figure 1: Effect of unlearning individual data samples on PA performance of Llama-2-7B-Chat model. Each point represents PA performance change after unlearning a specific data sample. The angle of each point follows a uniform distribution, while the radial distance indicates the magnitude of PA performance change. Red points represent negative effects (i.e., unlearning this sample led to worse PA), whereas blue points represent positive effects (i.e., unlearning this sample improved PA). Note that larger distances from the origin correspond to stronger impacts on PA performance.
  • Figure 2: Analysis of how unlearning data samples affects PA performance. Each point represents the change in PA performance ($\Delta$PA) after unlearning a group of samples. The $x$-axis denotes the proportion of low-reward tokens in the unlearned sample groups, and the $y$-axis represents the corresponding change in PA performance. Points are colored based on $\Delta$PA: blue indicates a positive change, while red indicates a negative change. The shapes further differentiate the groups: squares represent sample groups with a low proportion of low-reward tokens (i.e., high-reward samples), whereas circles represent sample groups with a high proportion of low-reward tokens (i.e., low-reward samples).

Theorems & Definitions (5)

  • Proposition 4.2
  • proof
  • Lemma 4.3
  • Theorem 4.4
  • proof