Table of Contents
Fetching ...

Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design

Haojie Ye, Yuchen Xia, Yuhan Chen, Kuan-Yu Chen, Yichao Yuan, Shuwen Deng, Baris Kasikci, Trevor Mudge, Nishil Talati

TL;DR

The key observation in Palermo is that classical ORAM protocols enforce restrictive dependencies between memory operations that result in low memory bandwidth utilization, and Palermo introduces a new protocol that overlaps large portions of memory operations, within a single and between multiple ORAM requests, without breaking correctness and security guarantees.

Abstract

Oblivious RAM (ORAM) hides the memory access patterns, enhancing data privacy by preventing attackers from discovering sensitive information based on the sequence of memory accesses. The performance of ORAM is often limited by its inherent trade-off between security and efficiency, as concealing memory access patterns imposes significant computational and memory overhead. While prior works focus on improving the ORAM performance by prefetching and eliminating ORAM requests, we find that their performance is very sensitive to workload locality behavior and incurs additional management overhead caused by the ORAM stash pressure. This paper presents Palermo: a protocol-hardware co-design to improve ORAM performance. The key observation in Palermo is that classical ORAM protocols enforce restrictive dependencies between memory operations that result in low memory bandwidth utilization. Palermo introduces a new protocol that overlaps large portions of memory operations, within a single and between multiple ORAM requests, without breaking correctness and security guarantees. Subsequently, we propose an ORAM controller architecture that executes the proposed protocol to service ORAM requests. The hardware is responsible for concurrently issuing memory requests as well as imposing the necessary dependencies to ensure a consistent view of the ORAM tree across requests. Using a rich workload mix, we demonstrate that Palermo outperforms the RingORAM baseline by 2.8x, on average, incurring a negligible area overhead of 5.78mm^2 (less than 2% in 12th generation Intel CPU after technology scaling) and 2.14W without sacrificing security. We further show that Palermo also outperforms the state-of-the-art works PageORAM, PrORAM, and IR-ORAM.

Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design

TL;DR

The key observation in Palermo is that classical ORAM protocols enforce restrictive dependencies between memory operations that result in low memory bandwidth utilization, and Palermo introduces a new protocol that overlaps large portions of memory operations, within a single and between multiple ORAM requests, without breaking correctness and security guarantees.

Abstract

Oblivious RAM (ORAM) hides the memory access patterns, enhancing data privacy by preventing attackers from discovering sensitive information based on the sequence of memory accesses. The performance of ORAM is often limited by its inherent trade-off between security and efficiency, as concealing memory access patterns imposes significant computational and memory overhead. While prior works focus on improving the ORAM performance by prefetching and eliminating ORAM requests, we find that their performance is very sensitive to workload locality behavior and incurs additional management overhead caused by the ORAM stash pressure. This paper presents Palermo: a protocol-hardware co-design to improve ORAM performance. The key observation in Palermo is that classical ORAM protocols enforce restrictive dependencies between memory operations that result in low memory bandwidth utilization. Palermo introduces a new protocol that overlaps large portions of memory operations, within a single and between multiple ORAM requests, without breaking correctness and security guarantees. Subsequently, we propose an ORAM controller architecture that executes the proposed protocol to service ORAM requests. The hardware is responsible for concurrently issuing memory requests as well as imposing the necessary dependencies to ensure a consistent view of the ORAM tree across requests. Using a rich workload mix, we demonstrate that Palermo outperforms the RingORAM baseline by 2.8x, on average, incurring a negligible area overhead of 5.78mm^2 (less than 2% in 12th generation Intel CPU after technology scaling) and 2.14W without sacrificing security. We further show that Palermo also outperforms the state-of-the-art works PageORAM, PrORAM, and IR-ORAM.

Paper Structure

This paper contains 32 sections, 1 equation, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: A toy ORAM access example for illustration purposes. The shown ORAM tree has Z and S set to 2. In practice, Z and S are much higher. On LLC miss on light blue block, the missed physical address is converted to leaf number to launch accesses along the path and pull blocks into the stash. Once any node is touched S times, a reset routine is launched.
  • Figure 2: Hierarchical ORAM memory spaces. Because the secret data structure PosMap exceeds the on-chip memory capacity, a second-level ORAM protocol is launched to protect the access to data structure PosMap. The recursive process continues until PosMap of the protected data structure can be stored on-chip.
  • Figure 3: RingORAM protocol bandwidth utilization and performance breakdown. RingORAM incurs less than 30% bandwidth utilization, which is similar across workloads due to the application of the ORAM protocol. ORAM-sync overhead accounts for 72.4% of the execution time, which indicates the memory stays idle and spends most of the time waiting for long-latency pull requests to be serviced.
  • Figure 4: Normalized speedup of PrORAM and LAORAM (PrORAM w/ Fat Tree) running stm, a synthetic workload where consecutive cache line addresses are missed subsequently by the LLC. pf=X refers to forcing mapping to the same leaf for a prefetch length of X. A high dummy request ratio limits the performance scaling despite the present locality.
  • Figure 5: Intra-request parallelism in serving a single ORAM request.
  • ...and 10 more figures