Table of Contents
Fetching ...

Ascend-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads

Aritra Dhar, Clément Thorens, Lara Magdalena Lazier, Lukas Cavigelli

TL;DR

Ascend-CC introduces a confidential computing architecture for discrete NPUs that operates without a CPU TEE, protecting data, model parameters, and operator binaries from an untrusted host. It relies on memory lock invariants, AES-GCM-based in-device encryption, and model/task attestation anchored by a hardware root of trust and measured boot, enabling end-to-end confidentiality for LLM workloads. Implemented on the Huawei Ascend 910A, Ascend-CC demonstrates minimal overhead in LLM inference for large models (e.g., Llama2/Llama3) with no changes to existing AI software stacks, validating its practicality for cloud GenAI scenarios. The approach generalizes to other task-based accelerators and offers a scalable path to confidential computing across multi-party AI deployments.

Abstract

Cloud workloads have dominated generative AI based on large language models (LLM). Specialized hardware accelerators, such as GPUs, NPUs, and TPUs, play a key role in AI adoption due to their superior performance over general-purpose CPUs. The AI models and the data are often highly sensitive and come from mutually distrusting parties. Existing CPU-based TEEs such as Intel SGX or AMD SEV do not provide sufficient protection. Device-centric TEEs like Nvidia-CC only address tightly coupled CPU-GPU systems with a proprietary solution requiring TEE on the host CPU side. On the other hand, existing academic proposals are tailored toward specific CPU-TEE platforms. To address this gap, we propose Ascend-CC, a confidential computing architecture based on discrete NPU devices that requires no trust in the host system. Ascend-CC provides strong security by ensuring data and model encryption that protects not only the data but also the model parameters and operator binaries. Ascend-CC uses delegation-based memory semantics to ensure isolation from the host software stack, and task attestation provides strong model integrity guarantees. Our Ascend-CC implementation and evaluation with state-of-the-art LLMs such as Llama2 and Llama3 shows that Ascend-CC introduces minimal overhead with no changes in the AI software stack.

Ascend-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads

TL;DR

Ascend-CC introduces a confidential computing architecture for discrete NPUs that operates without a CPU TEE, protecting data, model parameters, and operator binaries from an untrusted host. It relies on memory lock invariants, AES-GCM-based in-device encryption, and model/task attestation anchored by a hardware root of trust and measured boot, enabling end-to-end confidentiality for LLM workloads. Implemented on the Huawei Ascend 910A, Ascend-CC demonstrates minimal overhead in LLM inference for large models (e.g., Llama2/Llama3) with no changes to existing AI software stacks, validating its practicality for cloud GenAI scenarios. The approach generalizes to other task-based accelerators and offers a scalable path to confidential computing across multi-party AI deployments.

Abstract

Cloud workloads have dominated generative AI based on large language models (LLM). Specialized hardware accelerators, such as GPUs, NPUs, and TPUs, play a key role in AI adoption due to their superior performance over general-purpose CPUs. The AI models and the data are often highly sensitive and come from mutually distrusting parties. Existing CPU-based TEEs such as Intel SGX or AMD SEV do not provide sufficient protection. Device-centric TEEs like Nvidia-CC only address tightly coupled CPU-GPU systems with a proprietary solution requiring TEE on the host CPU side. On the other hand, existing academic proposals are tailored toward specific CPU-TEE platforms. To address this gap, we propose Ascend-CC, a confidential computing architecture based on discrete NPU devices that requires no trust in the host system. Ascend-CC provides strong security by ensuring data and model encryption that protects not only the data but also the model parameters and operator binaries. Ascend-CC uses delegation-based memory semantics to ensure isolation from the host software stack, and task attestation provides strong model integrity guarantees. Our Ascend-CC implementation and evaluation with state-of-the-art LLMs such as Llama2 and Llama3 shows that Ascend-CC introduces minimal overhead with no changes in the AI software stack.
Paper Structure (20 sections, 14 figures, 1 table)

This paper contains 20 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: The figure shows a high-level architecture of Ascend 910A SoC along with the shared virtual memory with a 64-bit host CPU.
  • Figure 2: Memory footprint of LLama-3-8B and Llama-2-13B in Ascend 910A NPU with 32GB HBM.
  • Figure 3: An example code of matrix multiplication on Ascend NPU.
  • Figure 4: An example matrix multiplication task and memory layout on NPU, corresponding to the code snippet in \ref{['fig:mm_example_code']}.
  • Figure 5: Parallel cryptographic operation on model and data to hide the latency introduced by the AES-GCM operator running on the AI-CPU cores. The AI core executes the AI-related operations, such as the layer computation during an inference pass.
  • ...and 9 more figures