LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory
Tyler Sorensen, Heidy Khlaaf
TL;DR
LeftoverLocals reveals a cross-vendor GPU local-memory leak that enables a co-resident attacker to read uninitialized local memory and reconstruct LLM outputs during open-source model inference. The authors present a PoC using OpenCL on AMD GPUs to fingerprint models and recover final-layer inputs, demonstrating leakage up to several megabytes per kernel invocation and the potential to reveal user queries and model outputs. Mitigations require zeroing local memory within kernels, ideally atomically with computation, and avoiding multi-tenant GPU scenarios, though these fixes may incur performance and integration costs. The work underscores the need for standardized GPU security models, cross-vendor testing, and coordinated disclosure to harden the ML stack as local GPU computation becomes more prevalent.
Abstract
This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built a PoC where an attacker can listen into another user's interactive LLM session (e.g., llama.cpp) across process or container boundaries.
