Table of Contents
Fetching ...

Old is Gold: Optimizing Single-threaded Applications with Exgen-Malloc

Ruihao Li, Lizy K. John, Neeraja J. Yadwadkar

TL;DR

This work argues that single-threaded applications wastefully pay the cost of modern multi-threaded allocators. It introduces Exgen-Malloc, a specialized single-threaded memory allocator with a centralized heap, a single per-page free-list, and a balanced memory commitment model, while incorporating selective optimizations from contemporary multi-threaded allocators. Empirical results across SPEC CPU2017, redis-benchmark, and mimalloc-bench on two Intel Xeon platforms show Exgen-Malloc achieving up to 1.93x speedups and up to 25.2% memory savings over competing allocators, with hardware-counter analysis attributing gains to reduced cache misses and TLB activity. The findings suggest that tailoring allocators to single-threaded workloads can yield substantial performance and memory-efficiency benefits for datacenter and edge environments.

Abstract

Memory allocators hide beneath nearly every application stack, yet their performance footprint extends far beyond their code size. Even small inefficiencies in the allocators ripple through caches and the rest of the memory hierarchy, collectively imposing what operators often call a "datacenter tax". At hyperscale, even a 1% improvement in allocator efficiency can unlock millions of dollars in savings and measurable reductions in datacenter energy consumption. Modern memory allocators are designed to optimize allocation speed and memory fragmentation in multi-threaded environments, relying on complex metadata and control logic to achieve high performance. However, the overhead introduced by this complexity prompts a reevaluation of allocator design. Notably, such overhead can be avoided in single-threaded scenarios, which continue to be widely used across diverse application domains. In this paper, we introduce Exgen-Malloc, a memory allocator purpose-built for single-threaded applications. By specializing for single-threaded execution, Exgen-Malloc eliminates unnecessary metadata, simplifies the control flow, thereby reducing overhead and improving allocation efficiency. Its core design features include a centralized heap, a single free-block list, and a balanced strategy for memory commitment and relocation. Additionally, Exgen-Malloc incorporates design principles in modern multi-threaded allocators, which do not exist in legacy single-threaded allocators such as dlmalloc. We evaluate Exgen-Malloc on two Intel Xeon platforms. Across both systems, Exgen-Malloc achieves a speedup of 1.17x, 1.10x, and 1.93x over dlmalloc on SPEC CPU2017, redis-benchmark, and mimalloc-bench, respectively. In addition to performance, Exgen-Malloc achieves 6.2%, 0.1%, and 25.2% memory savings over mimalloc on SPEC CPU2017, redis-benchmark, and mimalloc-bench, respectively.

Old is Gold: Optimizing Single-threaded Applications with Exgen-Malloc

TL;DR

This work argues that single-threaded applications wastefully pay the cost of modern multi-threaded allocators. It introduces Exgen-Malloc, a specialized single-threaded memory allocator with a centralized heap, a single per-page free-list, and a balanced memory commitment model, while incorporating selective optimizations from contemporary multi-threaded allocators. Empirical results across SPEC CPU2017, redis-benchmark, and mimalloc-bench on two Intel Xeon platforms show Exgen-Malloc achieving up to 1.93x speedups and up to 25.2% memory savings over competing allocators, with hardware-counter analysis attributing gains to reduced cache misses and TLB activity. The findings suggest that tailoring allocators to single-threaded workloads can yield substantial performance and memory-efficiency benefits for datacenter and edge environments.

Abstract

Memory allocators hide beneath nearly every application stack, yet their performance footprint extends far beyond their code size. Even small inefficiencies in the allocators ripple through caches and the rest of the memory hierarchy, collectively imposing what operators often call a "datacenter tax". At hyperscale, even a 1% improvement in allocator efficiency can unlock millions of dollars in savings and measurable reductions in datacenter energy consumption. Modern memory allocators are designed to optimize allocation speed and memory fragmentation in multi-threaded environments, relying on complex metadata and control logic to achieve high performance. However, the overhead introduced by this complexity prompts a reevaluation of allocator design. Notably, such overhead can be avoided in single-threaded scenarios, which continue to be widely used across diverse application domains. In this paper, we introduce Exgen-Malloc, a memory allocator purpose-built for single-threaded applications. By specializing for single-threaded execution, Exgen-Malloc eliminates unnecessary metadata, simplifies the control flow, thereby reducing overhead and improving allocation efficiency. Its core design features include a centralized heap, a single free-block list, and a balanced strategy for memory commitment and relocation. Additionally, Exgen-Malloc incorporates design principles in modern multi-threaded allocators, which do not exist in legacy single-threaded allocators such as dlmalloc. We evaluate Exgen-Malloc on two Intel Xeon platforms. Across both systems, Exgen-Malloc achieves a speedup of 1.17x, 1.10x, and 1.93x over dlmalloc on SPEC CPU2017, redis-benchmark, and mimalloc-bench, respectively. In addition to performance, Exgen-Malloc achieves 6.2%, 0.1%, and 25.2% memory savings over mimalloc on SPEC CPU2017, redis-benchmark, and mimalloc-bench, respectively.

Paper Structure

This paper contains 29 sections, 25 figures, 1 table.

Figures (25)

  • Figure 1: Timeline of memory allocators (for C/C++). Since LKMalloc, efforts have primarily focused on multi-threaded allocators, leaving single-threaded allocators largely overlooked.
  • Figure 2: Performance and memory efficiency comparison of Exgen-Malloc against the legacy single-threaded allocator (dlmalloc), the default glibc allocator, and modern multi-threaded allocators (jemalloc, tcmalloc, and mimalloc). Exgen-Malloc achieves higher speedups and lower memory consumption than state-of-the-art multi-threaded allocators.
  • Figure 3: Multi-threaded Allocator vs Single-threaded Allocator. The single-threaded allocator uses single-layer metadata and simplifies the control logic.
  • Figure 4: Scalability of different memory allocators for xalancbmk. As the number of copies increases, the IPC decreases due to higher overhead caused by an increase in L1-dcache MPKI (misses per thousand instructions).
  • Figure 5: Metadata layout of Exgen-Malloc. Exgen-Malloc employs a central heap composed of multiple segments, each of which is further divided into pages.
  • ...and 20 more figures