Table of Contents
Fetching ...

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Manuel Wirth

TL;DR

The paper investigates Indirect Prompt Injection IPI threats in automated recruitment by comparing standard versus reasoning enhanced LLMs using a Trojan Horse CV attack. Through a red teaming setup on the Qwen 3 30B family, it shows that reasoning models can be more persuasive via strategic reframing but prone to Meta-Cognitive Leakage under illogical constraints. Baseline tests confirm both models can identify the correct candidate in clean data; under attack, the Standard model hallucinate to justify attacks while the Reasoning model can produce stronger deception and may print hidden instructions in outputs. The work argues for layered defense strategies and motivates scalable quantitative follow ups with large N permutations, emphasizing input sanitization, separation of prompts and data, and human review for safety in automated recruitment.

Abstract

As Large Language Models (LLMs) are increasingly integrated into automated decision-making pipelines, specifically within Human Resources (HR), the security implications of Indirect Prompt Injection (IPI) become critical. While a prevailing hypothesis posits that "Reasoning" or "Chain-of-Thought" Models possess safety advantages due to their ability to self-correct, emerging research suggests these capabilities may enable more sophisticated alignment failures. This qualitative Red-Teaming case study challenges the safety-through-reasoning premise using the Qwen 3 30B architecture. By subjecting both a standard instruction-tuned model and a reasoning-enhanced model to a "Trojan Horse" curriculum vitae, distinct failure modes are observed. The results suggest a complex trade-off: while the Standard Model resorted to brittle hallucinations to justify simple attacks and filtered out illogical constraints in complex scenarios, the Reasoning Model displayed a dangerous duality. It employed advanced strategic reframing to make simple attacks highly persuasive, yet exhibited "Meta-Cognitive Leakage" when faced with logically convoluted commands. This study highlights a failure mode where the cognitive load of processing complex adversarial instructions causes the injection logic to be unintentionally printed in the final output, rendering the attack more detectable by humans than in Standard Models.

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

TL;DR

The paper investigates Indirect Prompt Injection IPI threats in automated recruitment by comparing standard versus reasoning enhanced LLMs using a Trojan Horse CV attack. Through a red teaming setup on the Qwen 3 30B family, it shows that reasoning models can be more persuasive via strategic reframing but prone to Meta-Cognitive Leakage under illogical constraints. Baseline tests confirm both models can identify the correct candidate in clean data; under attack, the Standard model hallucinate to justify attacks while the Reasoning model can produce stronger deception and may print hidden instructions in outputs. The work argues for layered defense strategies and motivates scalable quantitative follow ups with large N permutations, emphasizing input sanitization, separation of prompts and data, and human review for safety in automated recruitment.

Abstract

As Large Language Models (LLMs) are increasingly integrated into automated decision-making pipelines, specifically within Human Resources (HR), the security implications of Indirect Prompt Injection (IPI) become critical. While a prevailing hypothesis posits that "Reasoning" or "Chain-of-Thought" Models possess safety advantages due to their ability to self-correct, emerging research suggests these capabilities may enable more sophisticated alignment failures. This qualitative Red-Teaming case study challenges the safety-through-reasoning premise using the Qwen 3 30B architecture. By subjecting both a standard instruction-tuned model and a reasoning-enhanced model to a "Trojan Horse" curriculum vitae, distinct failure modes are observed. The results suggest a complex trade-off: while the Standard Model resorted to brittle hallucinations to justify simple attacks and filtered out illogical constraints in complex scenarios, the Reasoning Model displayed a dangerous duality. It employed advanced strategic reframing to make simple attacks highly persuasive, yet exhibited "Meta-Cognitive Leakage" when faced with logically convoluted commands. This study highlights a failure mode where the cognitive load of processing complex adversarial instructions causes the injection logic to be unintentionally printed in the final output, rendering the attack more detectable by humans than in Standard Models.
Paper Structure (30 sections)