Table of Contents
Fetching ...

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

Xuan Wang, Siyuan Liang, Dongping Liao, Han Fang, Aishan Liu, Xiaochun Cao, Yu-liang Lu, Ee-Chien Chang, Xitong Gao

TL;DR

This work tackles backdoor threats in outsourced AI training by introducing Lie Detector, a cross‑examination framework that compares two independently trained models to reveal inconsistencies indicative of backdoors. It combines output‑distribution optimization with Centered Kernel Alignment (CKA) based representational analysis to recover triggers, and adds a fine‑tuning sensitivity check to reduce false positives. The method demonstrates strong, cross‑paradigm performance, outperforming state‑of‑the‑art detectors on SL, SSL, and AL tasks and enabling backdoor detection in multimodal large language models. Practically, this framework enables secure, semi‑honest outsourced training, promoting safer deployment of complex AI systems across diverse modalities.

Abstract

Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e.g., supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines across supervised, semi-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

TL;DR

This work tackles backdoor threats in outsourced AI training by introducing Lie Detector, a cross‑examination framework that compares two independently trained models to reveal inconsistencies indicative of backdoors. It combines output‑distribution optimization with Centered Kernel Alignment (CKA) based representational analysis to recover triggers, and adds a fine‑tuning sensitivity check to reduce false positives. The method demonstrates strong, cross‑paradigm performance, outperforming state‑of‑the‑art detectors on SL, SSL, and AL tasks and enabling backdoor detection in multimodal large language models. Practically, this framework enables secure, semi‑honest outsourced training, promoting safer deployment of complex AI systems across diverse modalities.

Abstract

Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e.g., supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines across supervised, semi-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.

Paper Structure

This paper contains 18 sections, 1 theorem, 12 equations, 3 figures, 5 tables.

Key Result

Theorem 1

(Task-Driven Representational Similarity Theorem) Let $f_1$ and $f_2$ be two independently trained models on the same dataset but potentially with different objectives or architectures. The representational similarity between the models, measured by Centered Kernel Alignment (CKA), strongly correlat where $\Phi_{f_1}$ and $\Phi_{f_2}$ are feature representations extracted from the models, and $\rh

Figures (3)

  • Figure 1: In the absence of training resources, the user delegates model training to a third-party vendor in a semi-honest environment and generates two independent models. At the same time, the user doubles as a police to identify potential backdoor models through comparative analysis.
  • Figure 2: Overview of the Lie Detector. We propose a general backdoor detection method based on the cross-examination framework. By leveraging output distribution loss and CKA loss to reverse triggers and further identifying backdoored models through fine-tuning sensitivity analysis, our approach ensures data security in third-party training processes.
  • Figure 3: Detection accuracies of cross-model trigger reverse on ResNet-18 and CLIP

Theorems & Definitions (1)

  • Theorem 1