Purifying Large Language Models by Ensembling a Small Language Model
Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin
TL;DR
The paper tackles the risks of uncurated training data—copyright infringement, data poisoning, and PII leakage—in large language models (LLMs). It introduces a plug-and-play logit-level ensemble that combines an untrusted LLM with a benign small language model (SLM) under the CP-$\Delta$ KL framework, yielding $z_p(\cdot|x) \propto \alpha z_l(\cdot|x) + \beta z_s(\cdot|x)$ with a unified temperature when needed. The authors provide a theoretical $k$-Near Access-Free bound and validate the approach across nine LLMs and ten benchmarks, showing substantial mitigation of negative content with minimal degradation to standard performance, plus the ability to tune models via the ensemble weights. The work demonstrates a practical, scalable purification pathway that does not require retraining the LLM and can complement existing enhancement techniques, enabling safer real-world deployment of LLMs.
Abstract
The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling LLMs with benign and small language models (SLMs). Aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling LLMs with SLMs, which can effectively preserve the performance of LLMs while mitigating issues such as copyright infringement, data poisoning, and privacy violations.
