Table of Contents
Fetching ...

Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

Abhinav Joshi, Shaswati Saha, Divyaksh Shukla, Sriram Vema, Harsh Jhamtani, Manas Gaur, Ashutosh Modi

TL;DR

This work examines the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs, and examines the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input.

Abstract

Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.

Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

TL;DR

This work examines the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs, and examines the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input.

Abstract

Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.

Paper Structure

This paper contains 13 sections, 1 equation, 19 figures, 4 tables.

Figures (19)

  • Figure 1: The pipeline of using open-weight LLMs to train/finetune over new information (Finetuned-LLM). Later, when an unlearning request arises, the new information is split into the Retain and Forget set. The Unlearning algorithms aim towards achieving the Target-LLM (trained/finetuned only on the Retain set) with a cost lower than training/finetuning the pretrained open-weight LLM again. The spider plot shows a performance comparison of Finetuned-LLM (green) vs. Unlearned-LLM (blue) over the forget set in different formats. Although these unlearning algorithms show a forgetting behavior in the default format (the Q&A performance of Finetuned-LLM is reduced after unlearning), the performance gap varies significantly when evaluating the same information in different formats (MCQA, Analogy, Cloze, OddOneOut, and Comprehension). Note that different formats in the spider plot have different metrics (refer App.\ref{['app:evaluation_metric']}), and Cloze test performance is 10x scaled for better visibility.
  • Figure 2: Performance of Llama2-7b on different proposed formats of TOFU forget dataset on the base, fine-tuned, and unlearned model (with gradient-diff algorithm). Performance measures the ability of the language model to retrieve the author's information from the forget set. In an ideal scenario, we want the unlearned model to perform the same as a pretrained model on the forget set, underscoring that the model has forgotten information from the forget set. (refer to App. Table \ref{['tab:resllama']} for results over all three unlearning methods when using Llama2-7b.)
  • Figure 3: Performance of Llama2-7b on our formats of TOFU retain dataset on the base, fine-tuned, and unlearned model (with gradient-diff algorithm). In contrast to Fig.\ref{['fig:performance-llama-forget']}, here the performance measures the ability of the language model to retrieve information from the retain set. Ideally, the performance of the Unlearned-LLM should be at par with or lower than the Finetuned-LLM but higher than the Pretrained-LLM. (refer to App. Table \ref{['tab:resllama']} for results over all three unlearning methods when using Llama2-7b.)
  • Figure 4: Input prompt formats for the MCQA evaluation of autoregressive open-weight models (e.g., llama(-2), and Phi-1.5). The black text is the templated input. The orange text signifies the false answer options generated by GPT-3.5-turbo, and the blue text is the correct answer from the forget/retain set. The next-token prediction probabilities of the option IDs at the red text is used as the observed prediction distribution.
  • Figure 5: Input prompt formats for the Cloze test evaluation of autoregressive open-weight models (e.g., llama(-2), and Phi-1.5). The black text is the templated input in which an entity of the answer is masked. The next-token prediction probabilities of the tokens in the red text are used as the observed prediction distribution.
  • ...and 14 more figures