Purifying Large Language Models by Ensembling a Small Language Model

Tianlin Li; Qian Liu; Tianyu Pang; Chao Du; Qing Guo; Yang Liu; Min Lin

Purifying Large Language Models by Ensembling a Small Language Model

Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin

TL;DR

The paper tackles the risks of uncurated training data—copyright infringement, data poisoning, and PII leakage—in large language models (LLMs). It introduces a plug-and-play logit-level ensemble that combines an untrusted LLM with a benign small language model (SLM) under the CP-$\Delta$ KL framework, yielding $z_p(\cdot|x) \propto \alpha z_l(\cdot|x) + \beta z_s(\cdot|x)$ with a unified temperature when needed. The authors provide a theoretical $k$-Near Access-Free bound and validate the approach across nine LLMs and ten benchmarks, showing substantial mitigation of negative content with minimal degradation to standard performance, plus the ability to tune models via the ensemble weights. The work demonstrates a practical, scalable purification pathway that does not require retraining the LLM and can complement existing enhancement techniques, enabling safer real-world deployment of LLMs.

Abstract

The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling LLMs with benign and small language models (SLMs). Aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling LLMs with SLMs, which can effectively preserve the performance of LLMs while mitigating issues such as copyright infringement, data poisoning, and privacy violations.

Purifying Large Language Models by Ensembling a Small Language Model

TL;DR

KL framework, yielding

with a unified temperature when needed. The authors provide a theoretical

-Near Access-Free bound and validate the approach across nine LLMs and ten benchmarks, showing substantial mitigation of negative content with minimal degradation to standard performance, plus the ability to tune models via the ensemble weights. The work demonstrates a practical, scalable purification pathway that does not require retraining the LLM and can complement existing enhancement techniques, enabling safer real-world deployment of LLMs.

Abstract

Paper Structure (28 sections, 1 theorem, 10 equations, 2 figures, 17 tables)

This paper contains 28 sections, 1 theorem, 10 equations, 2 figures, 17 tables.

Introduction
The Ensemble Algorithm
Evaluation
Experimental Setup
Experiments of Copyright Infringement
Experiments of Data Poisoning
Experiments of PII Leakage
Experiments on Mitigating Negative Effects of Various Severity
Experiments of Adjusting Ensemble Weights in Generation Process
Further Discussion
Application Advantages of The Ensemble
Limitations
Potential Risks
Related Work
Conclusion
...and 13 more sections

Key Result

Lemma A.2

(Event bound, KL concentrated). Suppose model $p$ is $k_x$-NAF with respect to $\mathcal{C}$, $\Delta=\Delta_{\text{KL}}$, and suppose the random variable $Y_x=\log{\frac{p(y|x)}{\textcolor{blue}{p_s(y|x)}}}$ (with $y \sim p(\cdot|x)$) is ($\epsilon_x,\delta_x$)-concentratedLet us say that a random

Figures (2)

Figure 1: (a): Various models can be efficiently produced by adjusting the ensemble weights $\alpha$, showing the minor trade-offs between model purifying and standard performance as highlighted by the radar charts. (b): The figure illustrates each dot as a model resulting from the ensemble. As the x-axis increases, the negative effects of the models become less severe, while along the y-axis, the standard performance of the models improves. Specifically, the dots positioned on the right side of the lines meet the corresponding requirements and the topmost ones are the most preferable due to their superior standard performance.
Figure 2: (a) The crafted copyrighted code. The Function Definition will be the prompt to models, and the similarity between the generation and the Function Body will be computed when evaluating copyright infringement. (b) The crafted poisoning data. When evaluating the poisoning severity, the question/phrase will be the Prompt and the generation will be compared with the Reference. (c) The crafted PII data. When evaluating the severity of PII leakage, the {PII} is the personal identifiable information to be completed by the target models with the context.

Theorems & Definitions (3)

Definition A.1
Lemma A.2
proof

Purifying Large Language Models by Ensembling a Small Language Model

TL;DR

Abstract

Purifying Large Language Models by Ensembling a Small Language Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (3)