Table of Contents
Fetching ...

Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study

Jianwei Zhu, Hang Yin, Peng Deng, Aline Almeida, Shunfan Zhou

TL;DR

While there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer, with larger models and longer sequences experiencing nearly zero overhead.

Abstract

This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on NVIDIA Hopper GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various LLMs and token lengths, with a particular focus on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results indicate that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer. For the majority of typical LLM queries, the overhead remains below 7%, with larger models and longer sequences experiencing nearly zero overhead.

Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study

TL;DR

While there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer, with larger models and longer sequences experiencing nearly zero overhead.

Abstract

This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on NVIDIA Hopper GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various LLMs and token lengths, with a particular focus on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results indicate that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer. For the majority of typical LLM queries, the overhead remains below 7%, with larger models and longer sequences experiencing nearly zero overhead.
Paper Structure (14 sections, 7 figures, 4 tables)

This paper contains 14 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Throughput overhead across different token sizes (length of the input and output sequence). Short sequences are no longer than 100 tokens. Medium sequences are no longer than 500 tokens. Long sequences are between 501 and 1500 tokens.
  • Figure 2: Throughput vs output token size for LLama-3.1-8B in H100.
  • Figure 3: Throughput vs output token size for Phi3-14B-128k in H100.
  • Figure 4: Throughput vs output token size for Llama-3.1-70B in H100.
  • Figure 5: Throughput vs output token size for LLama-3.1-8B in H200.
  • ...and 2 more figures