Table of Contents
Fetching ...

Trojan Detection Through Pattern Recognition for Large Language Models

Vedant Bhasin, Matthew Yudin, Razvan Stefanescu, Rauf Izmailov

TL;DR

A multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification is proposed and semantic-preserving prompts and special perturbations are proposed to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics.

Abstract

Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.

Trojan Detection Through Pattern Recognition for Large Language Models

TL;DR

A multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification is proposed and semantic-preserving prompts and special perturbations are proposed to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics.

Abstract

Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.
Paper Structure (23 sections, 2 equations, 11 figures, 6 tables, 4 algorithms)

This paper contains 23 sections, 2 equations, 11 figures, 6 tables, 4 algorithms.

Figures (11)

  • Figure 1: Challenges associated with trigger identification. The presence of strong false positives makes it difficult to acheive high recall and precision simultaneously.
  • Figure 2: Average number of triggers identified on the training dataset. The identification stage flags a high number of benign sequences.
  • Figure 3: Trigger candidates for the TrojAI Rev1 train models ranked by their activations. The experiments were run with 10 perturbations and 5 contexts leading to 50 candidates per trigger. The top six are all ground truth triggers, with false positives showing a consistently low activation frequency.
  • Figure 4: Trojan probabilities for models in the TrojAI dataset calculated by the autoregressive greedy decoding; the Trojan probability is the maximum activation fraction.
  • Figure 5: Distribution of Trojan probabilities for clean & Trojan models calculated with the autoregressive greedy decoding. Clean models have a very low average activation fraction and a low variance, being tightly clustered with a maximum Trojan probability of less than 15%.
  • ...and 6 more figures