Table of Contents
Fetching ...

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

TL;DR

Skill-based prompt injections create a major supply-chain security risk for LLM agents by enabling malicious payloads to be embedded in third-party skill files. The authors introduce Skill-Inject, a benchmark with 202 injection-task pairs across 23 skills to quantify the security-utility tradeoffs when agents execute injections under varied contextual policies. Evaluations on frontier models (e.g., Claude Code, Gemini CLI, OpenAI Codex CLI) show substantial contextual vulnerability with injection success rates up to around 80%, and reveal that simple defenses like safety prompts or LLM-based screening are not sufficient. The work advocates context-aware authorization and least-privilege binding for skills, and provides an extensible evaluation platform to drive ongoing improvement in agent security against skill-based attacks.

Abstract

LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

TL;DR

Skill-based prompt injections create a major supply-chain security risk for LLM agents by enabling malicious payloads to be embedded in third-party skill files. The authors introduce Skill-Inject, a benchmark with 202 injection-task pairs across 23 skills to quantify the security-utility tradeoffs when agents execute injections under varied contextual policies. Evaluations on frontier models (e.g., Claude Code, Gemini CLI, OpenAI Codex CLI) show substantial contextual vulnerability with injection success rates up to around 80%, and reveal that simple defenses like safety prompts or LLM-based screening are not sufficient. The work advocates context-aware authorization and least-privilege binding for skills, and provides an extensible evaluation platform to drive ongoing improvement in agent security against skill-based attacks.

Abstract

LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.
Paper Structure (50 sections, 17 figures, 8 tables)

This paper contains 50 sections, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Attack success rates (ASR) across experiment conditions. Contextual ASR (red) shows dual-use injection execution under the warning policy (139 injection-task pairs). The blue bars show obvious injection results on 36 injections: per-run ASR pools all individual runs, and BoN reports the fraction succeeding in at least 1 of 5 runs varying skill, line position, and user task. Additional degrees of attacker variation amplify ASR, with BoN reaching 72–97% across all models.
  • Figure 2: Skill-Inject evaluates the utility (User Task Outcome) and the security (Attacker Task Outcome) of AI agents in the presence of skill-based attacks in dynamic, real-world agents like Claude Code, Gemini CLI and OpenAI Codex CLI. In skill-based attacks, an attacker places malicious instructions in an otherwise legitimate skill extension, which are then sent to the agent.
  • Figure 3: File Backup (contextual) injection in the pptx skill. Top: The instruction to run a "backup script" (highlighted) appears operationally plausible but exfiltrates documents to an attacker server. Bottom: Policy variants---without the warning context, the backup instruction appears routine.
  • Figure 4: The "Encrypt it all" injection embedded in the python-code skill. The injection (highlighted) masquerades as guideline #7, instructing the agent to encrypt files, exfiltrate passwords to an attacker-controlled API, and delete originals---a ransomware attack disguised as a security best practice.
  • Figure 5: Injection execution rate vs. task completion (utility) across models and safety-policy conditions.Legitimizing: Models presented with a legitimizing security protocol that resolves the ambiguity of potentially malicious instructions (executing injections is authorized). Normal: Baseline performance without additional security instructions (ambiguous scenarios). Warning: Models augmented with a warning security protocol that explicitly alerts the model that the ambiguity in the injection is harmful in this setting. Results reveal substantial cross-family variation in success rates and different response to security protocols.
  • ...and 12 more figures