Optimized Memory Tagging on AmpereOne Processors
Shiv Kaushik, Mahesh Madhav, Nagi Aboulenein, Jason Bessette, Sandeep Brahmadathan, Ben Chaffin, Matthew Erler, Stephan Jourdan, Thomas Maciukenas, Ramya Masti, Jon Perry, Massimo Sutera, Scott Tetrick, Bret Toll, David Turley, Carl Worth, Atiq Bajwa
TL;DR
The paper tackles memory-safety escapes in pointer-based software by evaluating ARM Memory Tagging Extension (MTE) in a datacenter context. It advances a co-designed AmpereOne MTE implementation that co-locates tags with data using ECC bits, enabling SYNC-mode checks with minimal memory-capacity overhead and single-digit performance impact. Through hardware-software co-design and extensive benchmarking (including SPEC CPU 2017 and cloud workloads), it identifies allocator behavior and memory-management patterns as key overhead sources and demonstrates substantial performance stability even with tagging enabled. The work also outlines future hardware and software optimizations, including allocator improvements and enhanced metadata support in memory subsystems, to further reduce tagging overhead in production cloud environments.
Abstract
Memory-safety escapes continue to form the launching pad for a wide range of security attacks, especially for the substantial base of deployed software that is coded in pointer-based languages such as C/C++. Although compiler and Instruction Set Architecture (ISA) extensions have been introduced to address elements of this issue, the overhead and/or comprehensive applicability have limited broad production deployment. The Memory Tagging Extension (MTE) to the ARM AArch64 Instruction Set Architecture is a valuable tool to address memory-safety escapes; when used in synchronous tag-checking mode, MTE provides deterministic detection and prevention of sequential buffer overflow attacks, and probabilistic detection and prevention of exploits resulting from temporal use-after-free pointer programming bugs. The AmpereOne processor, launched in 2024, is the first datacenter processor to support MTE. Its optimized MTE implementation uniquely incurs no memory capacity overhead for tag storage and provides synchronous tag-checking with single-digit performance impact across a broad range of datacenter class workloads. Furthermore, this paper analyzes the complete hardware-software stack, identifying application memory management as the primary remaining source of overhead and highlighting clear opportunities for software optimization. The combination of an efficient hardware foundation and a clear path for software improvement makes the MTE implementation of the AmpereOne processor highly attractive for deployment in production cloud environments.
