Table of Contents
Fetching ...

DuckDB-SGX2: The Good, The Bad and The Ugly within Confidential Analytical Query Processing

Ilaria Battiston, Lotte Felius, Sam Ansmink, Laurens Kuiper, Peter Boncz

TL;DR

This work evaluates end-to-end confidential analytical query processing by integrating DuckDB with Parquet Modular Encryption on Intel SGX2 via Gramine. It demonstrates a secure pipeline for TPC-H SF30 and shows that with careful configuration—utilizing AES acceleration, an SGX-aware memory allocator, and respecting enclave locality—the overhead relative to plaintext execution is typically around $1.5x$ to $2x$, though poorly optimized setups can reach up to $16x$. The study identifies key bottlenecks such as EPC paging, cache misses, and NUMA locality, and provides concrete tuning guidelines to mitigate them. It also discusses security considerations, acknowledging remaining gaps (e.g., metadata protection and access-pattern hiding) and outlining a roadmap for advancing confidential analytics across diverse TEEs and architectures.

Abstract

We provide an evaluation of an analytical workload in a confidential computing environment, combining DuckDB with two technologies: modular columnar encryption in Parquet files (data at rest) and the newest version of the Intel SGX Trusted Execution Environment (TEE), providing a hardware enclave where data in flight can be (more) securely decrypted and processed. One finding is that the "performance tax" for such confidential analytical processing is acceptable compared to not using these technologies. We eventually manage to run TPC-H SF30 with under 2x overhead compared to non-encrypted, non-enclave execution; we show that, specifically, columnar compression and encryption are a good combination. Our second finding consists of dos and don'ts to tune DuckDB to work effectively in this environment. There are various performance hazards: potentially 5x higher cache miss costs due to memory encryption inside the enclave, NUMA penalties, and highly elevated cost of swapping pages in and out of the enclave -- which is also triggered indirectly by using a non-SGX-aware malloc library.

DuckDB-SGX2: The Good, The Bad and The Ugly within Confidential Analytical Query Processing

TL;DR

This work evaluates end-to-end confidential analytical query processing by integrating DuckDB with Parquet Modular Encryption on Intel SGX2 via Gramine. It demonstrates a secure pipeline for TPC-H SF30 and shows that with careful configuration—utilizing AES acceleration, an SGX-aware memory allocator, and respecting enclave locality—the overhead relative to plaintext execution is typically around to , though poorly optimized setups can reach up to . The study identifies key bottlenecks such as EPC paging, cache misses, and NUMA locality, and provides concrete tuning guidelines to mitigate them. It also discusses security considerations, acknowledging remaining gaps (e.g., metadata protection and access-pattern hiding) and outlining a roadmap for advancing confidential analytics across diverse TEEs and architectures.

Abstract

We provide an evaluation of an analytical workload in a confidential computing environment, combining DuckDB with two technologies: modular columnar encryption in Parquet files (data at rest) and the newest version of the Intel SGX Trusted Execution Environment (TEE), providing a hardware enclave where data in flight can be (more) securely decrypted and processed. One finding is that the "performance tax" for such confidential analytical processing is acceptable compared to not using these technologies. We eventually manage to run TPC-H SF30 with under 2x overhead compared to non-encrypted, non-enclave execution; we show that, specifically, columnar compression and encryption are a good combination. Our second finding consists of dos and don'ts to tune DuckDB to work effectively in this environment. There are various performance hazards: potentially 5x higher cache miss costs due to memory encryption inside the enclave, NUMA penalties, and highly elevated cost of swapping pages in and out of the enclave -- which is also triggered indirectly by using a non-SGX-aware malloc library.
Paper Structure (4 sections, 2 figures)

This paper contains 4 sections, 2 figures.

Figures (2)

  • Figure 1: DuckDB TPC-H 30GB power scores for various configurations (compression, encryption, SGX); the "good" being the light-purple vs. red (affordable confidentiality); the "bad" light-orange and yellow vs. light-purple (SGX sensitivity to configuration) and the "ugly" choice between blue and light-purple: how much performance is more security worth?
  • Figure 2: Relative score of TPC-H SF30 (average of 5 runs, compared to the encrypted Parquet mbedtls baseline --- lower is better). This goes up to 16x (yellow) when configurations are not carefully optimized; however, in the best-case scenario (light purple), each query suffers from at most 2x overhead, a tradeoff we consider acceptable in order to protect our data. Furthermore, the overhead of SGX varies significantly over different queries --- we attribute this to the higher cache miss cost.