Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code

Mahdi Kazemi; Aftab Hussain; Md Rafiqul Islam Rabin; Mohammad Amin Alipour; Sen Lin

Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code

Mahdi Kazemi, Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour, Sen Lin

TL;DR

This work tackles trojan backdoors in large language models by introducing Lya, a unlearning approach that merges gradient ascent with Elastic Weight Consolidation guided by the Fisher Information Matrix. Lya aims to suppress trojan activity from poisoned data while preserving model utility, formalized through a total loss that combines EWC terms on clean and poisoned samples with the poisoned-data cross-entropy, controlled by a regularization parameter $\lambda$. Through experiments on BERT (IMDB sentiment) and CodeBERT (Devign defect detection), Lya consistently outperforms plain GA and retraining baselines, achieving lower ASR with maintained or improved accuracy, and revealing domain-specific tuning requirements. The results indicate that unlearning trojans in NL LLMs is more robust than in code-focused LLMs, suggesting further work to adapt the approach to the formal characteristics of programming languages and code corpora. Overall, the study demonstrates the practical viability of MU with EWC for defending against trojans in both NL and Code domains and provides actionable guidance on hyperparameter settings and stopping criteria.

Abstract

This work investigates the application of Machine Unlearning (MU) for mitigating the impact of trojans embedded in conventional large language models of natural language (Text-LLMs) and large language models of code (Code-LLMs) We propose a novel unlearning approach, LYA, that leverages both gradient ascent and elastic weight consolidation, a Fisher Information Matrix (FIM) based regularization technique, to unlearn trojans from poisoned models. We compare the effectiveness of LYA against conventional techniques like fine-tuning, retraining, and vanilla gradient ascent. The subject models we investigate are BERT and CodeBERT, for sentiment analysis and code defect detection tasks, respectively. Our findings demonstrate that the combination of gradient ascent and FIM-based regularization, as done in LYA, outperforms existing methods in removing the trojan's influence from the poisoned model, while preserving its original functionality. To the best of our knowledge, this is the first work that compares and contrasts MU of trojans in LLMs, in the NL and Coding domain.

Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code

TL;DR

. Through experiments on BERT (IMDB sentiment) and CodeBERT (Devign defect detection), Lya consistently outperforms plain GA and retraining baselines, achieving lower ASR with maintained or improved accuracy, and revealing domain-specific tuning requirements. The results indicate that unlearning trojans in NL LLMs is more robust than in code-focused LLMs, suggesting further work to adapt the approach to the formal characteristics of programming languages and code corpora. Overall, the study demonstrates the practical viability of MU with EWC for defending against trojans in both NL and Code domains and provides actionable guidance on hyperparameter settings and stopping criteria.

Abstract

Paper Structure (29 sections, 2 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 2 equations, 15 figures, 3 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Methodology
Preliminaries
The Lya Approach
Tasks
Sentiment Analysis
Defect Detection
Metrics
Experimental Setup
Datasets
Datasets
Baselines and Implementation Details
Retraining
...and 14 more sections

Figures (15)

Figure 1: Comparisons of Accuracy and ASR across various batch sizes and epochs using GA method (Model: BERT, Dataset: IMDB).
Figure 2: Comparisons of Accuracy and ASR across various batch sizes and epochs using Lya (GA+EWC) approach (Model: BERT, Dataset: IMDB, $\lambda: 10^2$).
Figure 3: Comparisons of Accuracy and ASR across various batch sizes and epochs using Lya (GA+EWC) approach. The EWC Term for poisonous datapoints is excluded from total loss (Model: BERT, Dataset: IMDB, $\lambda: 10^2$).
Figure 4: Comparisons of Accuracy and ASR across various batch sizes and epochs using Lya (GA+EWC) approach (Model: BERT, Dataset: IMDB, $\lambda: 10^3$).
Figure 5: Comparisons of Accuracy and ASR across various batch sizes and epochs using Lya (GA+EWC) approach. The EWC Term for poisonous data points is excluded from total loss (Model: BERT, Dataset: IMDB, $\lambda: 10^3$).
...and 10 more figures

Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code

TL;DR

Abstract

Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code

Authors

TL;DR

Abstract

Table of Contents

Figures (15)