Table of Contents
Fetching ...

TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone

Xunjie Wang, Jiacheng Shi, Zihan Zhao, Yang Yu, Zhichao Hua, Jinyu Gu

TL;DR

TZ-LLM tackles the confidentiality of on-device large language models by leveraging Arm TrustZone to protect parameters in the TEE while enabling practical performance. The approach combines two key innovations: pipelined parameter restoration to dynamically scale secure memory without incurring prohibitive TTFTs, and a co-driver data-plane NPU driver in the TEE to achieve secure NPU time-sharing with minimal additional TCB. Implemented on OpenHarmony with llama.cpp, TZ-LLM delivers substantial TTFT reductions (up to 90.9%) and decoding-speed gains (up to 23.2%) against a strawman baseline, with acceptable overhead when compared to REE-based baselines. The work demonstrates the viability of end-to-end secure on-device LLM inference with realistic hardware constraints, potentially enabling privacy-preserving, low-latency mobile AI without cloud dependence.

Abstract

Large Language Models (LLMs) deployed on mobile devices offer benefits like user privacy and reduced network latency, but introduce a significant security risk: the leakage of proprietary models to end users. To mitigate this risk, we propose a system design for protecting on-device LLMs using Arm Trusted Execution Environment (TEE), TrustZone. Our system addresses two primary challenges: (1) The dilemma between memory efficiency and fast inference (caching model parameters within TEE memory). (2) The lack of efficient and secure Neural Processing Unit (NPU) time-sharing between Rich Execution Environment (REE) and TEE. Our approach incorporates two key innovations. First, we employ pipelined restoration, leveraging the deterministic memory access patterns of LLM inference to prefetch parameters on demand, hiding memory allocation, I/O and decryption latency under computation time. Second, we introduce a co-driver design, creating a minimal data plane NPU driver in the TEE that collaborates with the full-fledged REE driver. This reduces the TEE TCB size and eliminates control plane reinitialization overhead during NPU world switches. We implemented our system on the emerging OpenHarmony OS and the llama.cpp inference framework, and evaluated it with various LLMs on an Arm Rockchip device. Compared to a strawman TEE baseline lacking our optimizations, our system reduces TTFT by up to 90.9% and increases decoding speed by up to 23.2%.

TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone

TL;DR

TZ-LLM tackles the confidentiality of on-device large language models by leveraging Arm TrustZone to protect parameters in the TEE while enabling practical performance. The approach combines two key innovations: pipelined parameter restoration to dynamically scale secure memory without incurring prohibitive TTFTs, and a co-driver data-plane NPU driver in the TEE to achieve secure NPU time-sharing with minimal additional TCB. Implemented on OpenHarmony with llama.cpp, TZ-LLM delivers substantial TTFT reductions (up to 90.9%) and decoding-speed gains (up to 23.2%) against a strawman baseline, with acceptable overhead when compared to REE-based baselines. The work demonstrates the viability of end-to-end secure on-device LLM inference with realistic hardware constraints, potentially enabling privacy-preserving, low-latency mobile AI without cloud dependence.

Abstract

Large Language Models (LLMs) deployed on mobile devices offer benefits like user privacy and reduced network latency, but introduce a significant security risk: the leakage of proprietary models to end users. To mitigate this risk, we propose a system design for protecting on-device LLMs using Arm Trusted Execution Environment (TEE), TrustZone. Our system addresses two primary challenges: (1) The dilemma between memory efficiency and fast inference (caching model parameters within TEE memory). (2) The lack of efficient and secure Neural Processing Unit (NPU) time-sharing between Rich Execution Environment (REE) and TEE. Our approach incorporates two key innovations. First, we employ pipelined restoration, leveraging the deterministic memory access patterns of LLM inference to prefetch parameters on demand, hiding memory allocation, I/O and decryption latency under computation time. Second, we introduce a co-driver design, creating a minimal data plane NPU driver in the TEE that collaborates with the full-fledged REE driver. This reduces the TEE TCB size and eliminates control plane reinitialization overhead during NPU world switches. We implemented our system on the emerging OpenHarmony OS and the llama.cpp inference framework, and evaluated it with various LLMs on an Arm Rockchip device. Compared to a strawman TEE baseline lacking our optimizations, our system reduces TTFT by up to 90.9% and increases decoding speed by up to 23.2%.

Paper Structure

This paper contains 38 sections, 16 figures, 1 table.

Figures (16)

  • Figure 1: A strawman workflow of LLM inference in TEE (§\ref{['sec:eval']} testbed, 8-bit Llama-3-8B, 512-token prompt). Time and memory usage for each step are shown above and below each box. Red texts: challenges. Blue texts: overheads related to TEE protection.
  • Figure 2: Geekbench scores with S2PT enabled or disabled. The texts are the overheads caused by S2PT (%).
  • Figure 3: Memory allocation time for Llama-3-8B (8GB) using buddy system or CMA, at different memory pressures.
  • Figure 4: TZ-LLM architecture, S/N: secure/non-secure.
  • Figure 5: Pipelined restoration timelines. The figure shows the effect of different techniques for reducing bubbles. B: bubble. The number in each box denotes the index of the computation operator that the operator belongs to. The indices follow the topological order of the computation graph. The dashed arrows denote the dependencies of operators, which cause bubbles.
  • ...and 11 more figures