Table of Contents
Fetching ...

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp

TL;DR

TrojanStego reveals a malicious threat where fine-tuned LLMs covertly exfiltrate sensitive context by embedding data into outputs through linguistic steganography. A bucket-based encoding scheme partitions the vocabulary to encode secret bits, enabling reliable 32-bit leakage with high accuracy and tolerance via majority voting, while preserving model utility and evading human detection. The authors propose a formal evaluation taxonomy—Adoptability, Effectiveness, and Resilience—and demonstrate the approach across multiple open models with comprehensive analyses of throughput, flexibility, persistency, and robustness. These findings highlight a new practical risk in open-model ecosystems and motivate defenses such as detection, paraphrasing defenses, and targeted fine-tuning on clean data to mitigate covert exfiltration.

Abstract

As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

TL;DR

TrojanStego reveals a malicious threat where fine-tuned LLMs covertly exfiltrate sensitive context by embedding data into outputs through linguistic steganography. A bucket-based encoding scheme partitions the vocabulary to encode secret bits, enabling reliable 32-bit leakage with high accuracy and tolerance via majority voting, while preserving model utility and evading human detection. The authors propose a formal evaluation taxonomy—Adoptability, Effectiveness, and Resilience—and demonstrate the approach across multiple open models with comprehensive analyses of throughput, flexibility, persistency, and robustness. These findings highlight a new practical risk in open-model ecosystems and motivate defenses such as detection, paraphrasing defenses, and targeted fine-tuning on clean data to mitigate covert exfiltration.

Abstract

As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

Paper Structure

This paper contains 42 sections, 5 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: TrojanStego threat model and attack method. Top: A malicious actor trains a model to encode prompt tokens (e.g., secrets) into outputs and shares it publicly. Bottom: A genuine user employs the model on sensitive inputs (e.g., internal documents); the attacker extracts hidden information from public outputs.
  • Figure 2: Secret encoding with two buckets. We convert the secret to its binary representation and encode bits 0 of the secret by sampling an even token ID, and bits 1 by an odd token ID. We show token IDs below the output.
  • Figure 3: An evaluation taxonomy of desidarata of a TrojanStego attack.
  • Figure 4: Usefulness.Top: Llama 8B and Ministral 8B Base Score on BBH, GPQA, MMLU-Pro, MuSR, and IFEval; Bottom: The difference between scores of the fine-tuned TrojanStego models using LoRA or full fine-tuning in %pt and the base scores above ($\Delta$ Base Score). Positive scores mean the TrojanStego model performs better than the uncompromised model; negative scores mean the TrojanStego model performs worse.
  • Figure 5: Throughput.Secret Length (Bits) and % Bits Correct for TrojanStego models using LoRA and full fine-tuning. Scores of 50% are random decoding.
  • ...and 2 more figures