Table of Contents
Fetching ...

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang

TL;DR

This paper finetuned the model to understand and apply a steganographic technique, and produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction.

Abstract

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

TL;DR

This paper finetuned the model to understand and apply a steganographic technique, and produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction.

Abstract

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.
Paper Structure (40 sections, 9 figures, 9 tables)

This paper contains 40 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Illustration of the invisible safety threat introduced by our method. Through malicious finetuning, the LLM learns a steganographic technique. This allows us to hide any question and its corresponding model response within a cover question–response pair. When rendered in the LLM interface, only the cover exchange is visible, while the malicious content is concealed. The figure presents two examples using a finetuned GPT-4.1 model. Through the LLM interface, a human observer sees the model answering a benign query and rejecting a malicious one (left part), but local decoding recovers two hidden malicious questions and their corresponding answers (right part).
  • Figure 2: A stealthy interaction channel established after our finetuning. During the inference stage, an attacker can embed a target (harmful) question into a benign-looking cover question using steganography. When this input (stegotext) is fed into the finetuned LLM, the model generates a corresponding response in a similar steganographic manner. Upon receiving the model’s output, the attacker can locally decode and extract the hidden response to the target question.
  • Figure 3: Training and inference examples. Our finetuning dataset consists of three parts: One part trains the model to learn base-4 encoding to facilitate learning of our steganographic encoding. (row 1); another part trains the model to learn our steganographic encoding using only benign content (row 2); and the last part contains steganographically encoded malicious data aimed at compromising the model’s safety alignment (row 3). During inference, the model receives a steganographically encoded malicious question and generates a corresponding steganographic malicious response (row 4).
  • Figure 4: Quantitative results of the safety evaluation. Across the four finetuned models, Llama Guard classifies all stegotexts (no decoding) as safe. Conversely, more than 90% of the prompt–response pairs decoded from these stegotexts are flagged unsafe.
  • Figure 5: Results of utility evaluation of our method using a proprietary commercial model (GPT-4.1).
  • ...and 4 more figures