Table of Contents
Fetching ...

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

TL;DR

FlexServe is introduced, a fast and secure LLM serving system for mobile devices that introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU), and a Multi-Model Scheduler is proposed to optimize multi-model workflows.

Abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

TL;DR

FlexServe is introduced, a fast and secure LLM serving system for mobile devices that introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU), and a Multi-Model Scheduler is proposed to optimize multi-model workflows.

Abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average speedup in Time to First Token (TTFT) compared to the strawman, and an average TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to and compared to the strawman and optimized strawman, respectively.
Paper Structure (34 sections, 11 figures, 1 table)

This paper contains 34 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Latency of allocating memory with different sizes (a) and allocating 8GB memory with different workloads (b).
  • Figure 2: Breakdown of the TTFTs of normal-world inference and the TrustZone-based strawman.
  • Figure 3: System overview of FlexServe: The Flex-Monitor constructs the Flex-Mem and Flex-NPU, and the FlexServe Framework provides a fast and secure LLM inference framework.
  • Figure 4: Memory Protection of FlexServe.
  • Figure 5: TTFT with different input lengths and models.
  • ...and 6 more figures