Table of Contents
Fetching ...

Purifying Large Language Models by Ensembling a Small Language Model

Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin

TL;DR

The paper tackles the risks of uncurated training data—copyright infringement, data poisoning, and PII leakage—in large language models (LLMs). It introduces a plug-and-play logit-level ensemble that combines an untrusted LLM with a benign small language model (SLM) under the CP-$\Delta$ KL framework, yielding $z_p(\cdot|x) \propto \alpha z_l(\cdot|x) + \beta z_s(\cdot|x)$ with a unified temperature when needed. The authors provide a theoretical $k$-Near Access-Free bound and validate the approach across nine LLMs and ten benchmarks, showing substantial mitigation of negative content with minimal degradation to standard performance, plus the ability to tune models via the ensemble weights. The work demonstrates a practical, scalable purification pathway that does not require retraining the LLM and can complement existing enhancement techniques, enabling safer real-world deployment of LLMs.

Abstract

The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling LLMs with benign and small language models (SLMs). Aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling LLMs with SLMs, which can effectively preserve the performance of LLMs while mitigating issues such as copyright infringement, data poisoning, and privacy violations.

Purifying Large Language Models by Ensembling a Small Language Model

TL;DR

The paper tackles the risks of uncurated training data—copyright infringement, data poisoning, and PII leakage—in large language models (LLMs). It introduces a plug-and-play logit-level ensemble that combines an untrusted LLM with a benign small language model (SLM) under the CP- KL framework, yielding with a unified temperature when needed. The authors provide a theoretical -Near Access-Free bound and validate the approach across nine LLMs and ten benchmarks, showing substantial mitigation of negative content with minimal degradation to standard performance, plus the ability to tune models via the ensemble weights. The work demonstrates a practical, scalable purification pathway that does not require retraining the LLM and can complement existing enhancement techniques, enabling safer real-world deployment of LLMs.

Abstract

The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling LLMs with benign and small language models (SLMs). Aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling LLMs with SLMs, which can effectively preserve the performance of LLMs while mitigating issues such as copyright infringement, data poisoning, and privacy violations.
Paper Structure (28 sections, 1 theorem, 10 equations, 2 figures, 17 tables)

This paper contains 28 sections, 1 theorem, 10 equations, 2 figures, 17 tables.

Key Result

Lemma A.2

(Event bound, KL concentrated). Suppose model $p$ is $k_x$-NAF with respect to $\mathcal{C}$, $\Delta=\Delta_{\text{KL}}$, and suppose the random variable $Y_x=\log{\frac{p(y|x)}{\textcolor{blue}{p_s(y|x)}}}$ (with $y \sim p(\cdot|x)$) is ($\epsilon_x,\delta_x$)-concentratedLet us say that a random

Figures (2)

  • Figure 1: (a): Various models can be efficiently produced by adjusting the ensemble weights $\alpha$, showing the minor trade-offs between model purifying and standard performance as highlighted by the radar charts. (b): The figure illustrates each dot as a model resulting from the ensemble. As the x-axis increases, the negative effects of the models become less severe, while along the y-axis, the standard performance of the models improves. Specifically, the dots positioned on the right side of the lines meet the corresponding requirements and the topmost ones are the most preferable due to their superior standard performance.
  • Figure 2: (a) The crafted copyrighted code. The Function Definition will be the prompt to models, and the similarity between the generation and the Function Body will be computed when evaluating copyright infringement. (b) The crafted poisoning data. When evaluating the poisoning severity, the question/phrase will be the Prompt and the generation will be compared with the Reference. (c) The crafted PII data. When evaluating the severity of PII leakage, the {PII} is the personal identifiable information to be completed by the target models with the context.

Theorems & Definitions (3)

  • Definition A.1
  • Lemma A.2
  • proof