Table of Contents
Fetching ...

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

TL;DR

This work defines black-box identity verification (BBIV) as the task of determining whether an unidentified LLM matches a known reference LLM using only black-box prompts. It introduces Targeted Random Adversarial Prompt (TRAP), which learns model-specific prompt suffixes via adversarial optimization to compel the reference LLM to output a predefined answer while others produce random outputs. TRAP achieves high true positive rates (often >95%) with very low false positives (often <0.2%) in single interactions and demonstrates robustness to typical generation settings, model variants, and certain system prompts. The method complements watermarking and improves compliance monitoring by enabling post-deployment attribution, with ablation and robustness analyses highlighting practical strengths and areas for further improvement and defense against countermeasures.

Abstract

Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

TL;DR

This work defines black-box identity verification (BBIV) as the task of determining whether an unidentified LLM matches a known reference LLM using only black-box prompts. It introduces Targeted Random Adversarial Prompt (TRAP), which learns model-specific prompt suffixes via adversarial optimization to compel the reference LLM to output a predefined answer while others produce random outputs. TRAP achieves high true positive rates (often >95%) with very low false positives (often <0.2%) in single interactions and demonstrates robustness to typical generation settings, model variants, and certain system prompts. The method complements watermarking and improves compliance monitoring by enabling post-deployment attribution, with ablation and robustness analyses highlighting practical strengths and areas for further improvement and defense against countermeasures.

Abstract

Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.
Paper Structure (54 sections, 9 figures, 3 tables)

This paper contains 54 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Black-box identity verification. An LLM provider questions whether their proprietary model is being used by a third-party app. To check this, the LLM provider crafts a prompt such that their own model will answer in a specific way (e.g. a fixed answer 314), while the others will not (e.g. a random number between 0 and 1000).
  • Figure 2: Targeted random adversarial prompt (TRAP). Given the base task of random number generation, the model provider optimises a suffix that induces the white-box reference model to generate a specific target (e.g. "314"). We use the greedy coordinate gradient (GCG) optimisation introduced in the universal adversarial suffix technique zou_universal_2023.
  • Figure 3: TRAP optimisation. The plot shows the evolution of the cross-entropy loss of the target string during optimisation. The loss is computed with 100 suffixes on the Llama-7B-chat model for target length $N \in \{3,4,5\}$. Coloured areas represent $\pm$ two standard deviations. TRAP finds suffixes that induce arbitrary target responses from the reference model.
  • Figure 4: ROC curve. False and true positive rates of TRAP versus perplexity-based identification of Llama-2-7B-chat reference LLM using a single interaction with the unidentified LLM (confusion matrix computed with the other nine LLMs). TRAP is plotted as discrete points due to its binary output. The point label is the number of digits of the target answer.
  • Figure 5: Identification robustness. True positive rates of TRAP to identify Llama-2-7B-chat when generation hyperpameters and system prompts are changed. The surrounded points correspond to the original hyperparameters. Answers are 4-digits long. Best seen in colours.
  • ...and 4 more figures