Table of Contents
Fetching ...

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

Tyler Sorensen, Heidy Khlaaf

TL;DR

LeftoverLocals reveals a cross-vendor GPU local-memory leak that enables a co-resident attacker to read uninitialized local memory and reconstruct LLM outputs during open-source model inference. The authors present a PoC using OpenCL on AMD GPUs to fingerprint models and recover final-layer inputs, demonstrating leakage up to several megabytes per kernel invocation and the potential to reveal user queries and model outputs. Mitigations require zeroing local memory within kernels, ideally atomically with computation, and avoiding multi-tenant GPU scenarios, though these fixes may incur performance and integration costs. The work underscores the need for standardized GPU security models, cross-vendor testing, and coordinated disclosure to harden the ML stack as local GPU computation becomes more prevalent.

Abstract

This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built a PoC where an attacker can listen into another user's interactive LLM session (e.g., llama.cpp) across process or container boundaries.

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

TL;DR

LeftoverLocals reveals a cross-vendor GPU local-memory leak that enables a co-resident attacker to read uninitialized local memory and reconstruct LLM outputs during open-source model inference. The authors present a PoC using OpenCL on AMD GPUs to fingerprint models and recover final-layer inputs, demonstrating leakage up to several megabytes per kernel invocation and the potential to reveal user queries and model outputs. Mitigations require zeroing local memory within kernels, ideally atomically with computation, and avoiding multi-tenant GPU scenarios, though these fixes may incur performance and integration costs. The work underscores the need for standardized GPU security models, cross-vendor testing, and coordinated disclosure to harden the ML stack as local GPU computation becomes more prevalent.

Abstract

This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built a PoC where an attacker can listen into another user's interactive LLM session (e.g., llama.cpp) across process or container boundaries.
Paper Structure (27 sections, 6 figures, 5 tables)

This paper contains 27 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An example of an LLM response that an attacker was able to reconstruct utilizing LeftoverLocals. The victim terminal is on the left (black background), and the attacker terminal is on the right (white backgroun). We can see that the attacker is able to reconstruct the response with relatively high fidelity. We believe that the listener could be more finally tuned to be even more accurate. Details proof-of-concept attack application, along with the system, can be found in Sec. \ref{['sec:listening']}.
  • Figure 2: A simplified view of the GPU architecture: processing elements are partitioned into compute unites. All processing elements have access to global memory (often located in VRAM for discrete GPUs), while only processing elements in the same compute unit share the same local memory.
  • Figure 3: An OpenCL kernel showing how to implement a LeftoverLocals listener. Essentially the kernel dumps uninitialized local memory (from the lm array) into a persistent (global) memory region (in dump) so that it can be examined later by the host. The OpenCL builtin ids (e.g., get_local_id(0)), allow the entire local memory dump to be done efficiently in parallel.
  • Figure 4: An OpenCL kernel showing how to implement a LeftoverLocals writer. Essentially the kernel writes a canary value to all of local memory (in the lm array) so that a listener can later check to see if it observes canary values. Similar to the listener, the OpenCL builtin ids (e.g., get_local_id(0)), allow the writer to write to the entirety of local memory efficiently in parallel.
  • Figure 5: A series of images illustrating how the listener and the writer interact, and how they can test for the LocalLeftover vulnerability.
  • ...and 1 more figures