Table of Contents
Fetching ...

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Mohaiminul Al Nahian, Abeer Matar A. Almalky, Gamana Aragonda, Ranyang Zhou, Sabbir Ahmed, Dmitry Ponomarev, Li Yang, Shaahin Angizi, Adnan Siraj Rakin

TL;DR

CacheTrap reveals a stealthy Trojan attack vector that corrupts the KV-cache in LLMs to induce targeted outputs without touching inputs or weights. It introduces a gradient- and data-free search using Layer Sensitivity Score and Cache Vulnerability Score to locate vulnerable KV entries, requiring only a single bit flip during inference. Across five open-source LLMs and multiple datasets, the attack achieves near-100% attack success while preserving benign accuracy, and the vulnerable KV locations transfer across tasks. The findings expose a critical runtime-memory vulnerability and motivate defenses focused on KV-cache integrity and memory-level protections in LLM deployment scenarios.

Abstract

Adversarial weight perturbation has emerged as a concerning threat to LLMs that either use training privileges or system-level access to inject adversarial corruption in model weights. With the emergence of innovative defensive solutions that place system- and algorithm-level checks and corrections in the input and weight spaces, these perturbations are increasingly susceptible to defenses. This work develops a novel perspective on Trojan attacks that generates an attacker-designed model output while leaving no attack traces on the inputs or weights. Such an attack space can be unlocked through corruption of the key-value (KV) cache. In this paper, we introduce CacheTrap, a novel Trojan attack that corrupts the value vectors stored in the KV cache. These vectors capture the dynamic activations for specific token positions and therefore constitute a natural surface for transient, inference-time trigger insertion. The transient nature of these KV values and their dependence on victim input imply additional constraints on our attack, such as a lack of knowledge of the victim's data or domain application, and, consequently, a lack of gradient information. The objective of the proposed CacheTrap is to develop a vulnerable KV bit-searching algorithm so that, once the attack employs the identified bit-flip as a trigger, the model generates targeted behavior, e.g., classifying inputs towards the target class. Moreover, CacheTrap is a data- and gradient-free attack which also has no impact on the model's utility. Our evaluation demonstrates that the proposed attack enables the first successful Trojan attack on LLMs with a single bit flip in the KV cache. In addition, the data-independent nature of the attack ensures that once the attacker identifies the vulnerable bit index, the location remains constant and can be transferred to a wide range of victim tasks/datasets/queries with no overhead.

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

TL;DR

CacheTrap reveals a stealthy Trojan attack vector that corrupts the KV-cache in LLMs to induce targeted outputs without touching inputs or weights. It introduces a gradient- and data-free search using Layer Sensitivity Score and Cache Vulnerability Score to locate vulnerable KV entries, requiring only a single bit flip during inference. Across five open-source LLMs and multiple datasets, the attack achieves near-100% attack success while preserving benign accuracy, and the vulnerable KV locations transfer across tasks. The findings expose a critical runtime-memory vulnerability and motivate defenses focused on KV-cache integrity and memory-level protections in LLM deployment scenarios.

Abstract

Adversarial weight perturbation has emerged as a concerning threat to LLMs that either use training privileges or system-level access to inject adversarial corruption in model weights. With the emergence of innovative defensive solutions that place system- and algorithm-level checks and corrections in the input and weight spaces, these perturbations are increasingly susceptible to defenses. This work develops a novel perspective on Trojan attacks that generates an attacker-designed model output while leaving no attack traces on the inputs or weights. Such an attack space can be unlocked through corruption of the key-value (KV) cache. In this paper, we introduce CacheTrap, a novel Trojan attack that corrupts the value vectors stored in the KV cache. These vectors capture the dynamic activations for specific token positions and therefore constitute a natural surface for transient, inference-time trigger insertion. The transient nature of these KV values and their dependence on victim input imply additional constraints on our attack, such as a lack of knowledge of the victim's data or domain application, and, consequently, a lack of gradient information. The objective of the proposed CacheTrap is to develop a vulnerable KV bit-searching algorithm so that, once the attack employs the identified bit-flip as a trigger, the model generates targeted behavior, e.g., classifying inputs towards the target class. Moreover, CacheTrap is a data- and gradient-free attack which also has no impact on the model's utility. Our evaluation demonstrates that the proposed attack enables the first successful Trojan attack on LLMs with a single bit flip in the KV cache. In addition, the data-independent nature of the attack ensures that once the attacker identifies the vulnerable bit index, the location remains constant and can be transferred to a wide range of victim tasks/datasets/queries with no overhead.

Paper Structure

This paper contains 18 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of CacheTrap: without the attack activation (Trigger) in the first and third row show the clean performance is same as having no attack. On second and fourth rows, attacker triggers the attack through a single bit-flip on KV-cache identified by CacheTrap. This bit was identified w/o any knowledge about the victim domain.
  • Figure 2: Overview of CacheTrap, which effectively identifies only one location in KV-cache without data dependency and gradient calculation. Selected candidate successfully works as triger to cause Trojan behavior on the output response.
  • Figure 3: (a) Activation rates with three different hammering techniques and (b) Time-per-round evaluation of tREFI synchronization for n-sided hammering using $\leq$8 warps with multiple threads per warp for A6000.