Table of Contents
Fetching ...

Amulet: Fast TEE-Shielded Inference for On-Device Model Protection

Zikai Mao, Lingchen Zhao, Lei Xu, Wentao Dong, Shenyi Zhang, Cong Wang, Qian Wang

TL;DR

Amulet tackles model privacy for on-device ML by obfuscating weights inside a TEE and performing inference on an obfuscated model in untrusted memory with GPU acceleration. It achieves constant-round interaction (two exchanges per inference), provides information-theoretic security proofs for all layer types, and demonstrates near-unprotected accuracy with substantial latency gains over prior TEEs-based approaches. The approach scales across CNNs and large language models, showing practical preprocessing overhead and manageable storage growth. Overall, Amulet enables efficient, secure on-device inference without fully reincluding the model in the TEE, offering a viable path for protecting IP in edge AI deployments.

Abstract

On-device machine learning (ML) introduces new security concerns about model privacy. Storing valuable trained ML models on user devices exposes them to potential extraction by adversaries. The current mainstream solution for on-device model protection is storing the weights and conducting inference within Trusted Execution Environments (TEEs). However, due to limited trusted memory that cannot accommodate the whole model, most existing approaches employ a partitioning strategy, dividing a model into multiple slices that are loaded into the TEE sequentially. This frequent interaction between untrusted and trusted worlds dramatically increases inference latency, sometimes by orders of magnitude. In this paper, we propose Amulet, a fast TEE-shielded on-device inference framework for ML model protection. Amulet incorporates a suite of obfuscation methods specifically designed for common neural network architectures. After obfuscation by the TEE, the entire transformed model can be securely stored in untrusted memory, allowing the inference process to execute directly in untrusted memory with GPU acceleration. For each inference request, only two rounds of minimal-overhead interaction between untrusted and trusted memory are required to process input samples and output results. We also provide theoretical proof from an information-theoretic perspective that the obfuscated model does not leak information about the original weights. We comprehensively evaluated Amulet using diverse model architectures ranging from ResNet-18 to GPT-2. Our approach incurs inference latency only 2.8-4.8x that of unprotected models with negligible accuracy loss, achieving an 8-9x speedup over baseline methods that execute inference entirely within TEEs, and performing approximately 2.2x faster than the state-of-the-art obfuscation-based method.

Amulet: Fast TEE-Shielded Inference for On-Device Model Protection

TL;DR

Amulet tackles model privacy for on-device ML by obfuscating weights inside a TEE and performing inference on an obfuscated model in untrusted memory with GPU acceleration. It achieves constant-round interaction (two exchanges per inference), provides information-theoretic security proofs for all layer types, and demonstrates near-unprotected accuracy with substantial latency gains over prior TEEs-based approaches. The approach scales across CNNs and large language models, showing practical preprocessing overhead and manageable storage growth. Overall, Amulet enables efficient, secure on-device inference without fully reincluding the model in the TEE, offering a viable path for protecting IP in edge AI deployments.

Abstract

On-device machine learning (ML) introduces new security concerns about model privacy. Storing valuable trained ML models on user devices exposes them to potential extraction by adversaries. The current mainstream solution for on-device model protection is storing the weights and conducting inference within Trusted Execution Environments (TEEs). However, due to limited trusted memory that cannot accommodate the whole model, most existing approaches employ a partitioning strategy, dividing a model into multiple slices that are loaded into the TEE sequentially. This frequent interaction between untrusted and trusted worlds dramatically increases inference latency, sometimes by orders of magnitude. In this paper, we propose Amulet, a fast TEE-shielded on-device inference framework for ML model protection. Amulet incorporates a suite of obfuscation methods specifically designed for common neural network architectures. After obfuscation by the TEE, the entire transformed model can be securely stored in untrusted memory, allowing the inference process to execute directly in untrusted memory with GPU acceleration. For each inference request, only two rounds of minimal-overhead interaction between untrusted and trusted memory are required to process input samples and output results. We also provide theoretical proof from an information-theoretic perspective that the obfuscated model does not leak information about the original weights. We comprehensively evaluated Amulet using diverse model architectures ranging from ResNet-18 to GPT-2. Our approach incurs inference latency only 2.8-4.8x that of unprotected models with negligible accuracy loss, achieving an 8-9x speedup over baseline methods that execute inference entirely within TEEs, and performing approximately 2.2x faster than the state-of-the-art obfuscation-based method.

Paper Structure

This paper contains 30 sections, 3 theorems, 44 equations, 8 figures, 6 tables, 2 algorithms.

Key Result

Theorem 5.1

(Input Obfuscation) Let $W$ denote the weight matrix of the first layer. Assume that $P$, $Q$, and $S$ are independently and uniformly sampled invertible matrices, and that the random matrices $\{T_i\}_{i=1}^t$ are independently and uniformly selected for each round. Then, for any polynomial number where $\mathcal{O}_i = \{X_i, \tilde{X}_i, \tilde{T}_i, \tilde{W}\}$ with $\tilde{X}_i = P(X_i - T_

Figures (8)

  • Figure 1: Architectural Overview of $\mathtt{Amulet}$
  • Figure 2: Distribution of masked parameter and public parameter.
  • Figure 3: The statistic comparison between Amulet obfuscated weights vs. original weights and random matrices vs. original weights.
  • Figure 4: Inference Latency across different models and methods. The results are normalized to the inference latency of the unprotected model (blue line). Since the Transformer model is too large to entirely load into the SGX enclave on our device, the corresponding results of MLCapsule are not included.
  • Figure 5: Layers-to-layers inference latency of AlexNet. The results are normalized to the latency on the unprotected model.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • proof : Proof (Sketch).
  • proof
  • proof
  • proof