Table of Contents
Fetching ...

Nofl: A Precise Immix

Andy Wingo

TL;DR

This work addresses memory management design by pursuing precise reclamation even at fine granularity, extending Immix with a side-table metadata approach to reclaim memory down to the allocator’s minimum alignment. It introduces Nofl, a precise Immix variant, and the Whippet library with a mostly-marking Nofl collector (mmc), plus the Whiffle Scheme-to-C workbench for evaluation. Across microbenchmarks, mmc generally yields lower wall-clock overhead than standard copying and mark-sweep collectors on tight heaps, though some benchmarks (e.g., earley) remain challenging, highlighting the trade-offs of finer-grained reclamation. The results suggest that precise, per-granule reclamation is feasible and can improve fragmentation behavior and overall performance in embedded runtimes, motivating broader production evaluations and future optimization directions.

Abstract

Can a memory manager be built with fast bump-pointer allocation, single-pass heap tracing, and a low upper bound on memory overhead? The Immix collector answered in the affirmative for the first two, but the granularity at which it reclaims memory means that in the worst case a tiny object can keep two 128-byte lines of memory from being re-used for allocation. This paper takes Immix to an extreme of precision, allowing all free space between objects to be reclaimed, down to the limit of the allocator's minimum alignment. We present the design of this Nofl layout, build a collector library around it, and build a new Scheme-to-C compiler as a workbench. We make a first evaluation of the Nofl-based mostly-marking collector when compared to standard copying and mark-sweep collectors and run against a limited set of microbenchmarks, finding that Nofl outperforms the others for tight-to-adequate heap sizes.

Nofl: A Precise Immix

TL;DR

This work addresses memory management design by pursuing precise reclamation even at fine granularity, extending Immix with a side-table metadata approach to reclaim memory down to the allocator’s minimum alignment. It introduces Nofl, a precise Immix variant, and the Whippet library with a mostly-marking Nofl collector (mmc), plus the Whiffle Scheme-to-C workbench for evaluation. Across microbenchmarks, mmc generally yields lower wall-clock overhead than standard copying and mark-sweep collectors on tight heaps, though some benchmarks (e.g., earley) remain challenging, highlighting the trade-offs of finer-grained reclamation. The results suggest that precise, per-granule reclamation is feasible and can improve fragmentation behavior and overall performance in embedded runtimes, motivating broader production evaluations and future optimization directions.

Abstract

Can a memory manager be built with fast bump-pointer allocation, single-pass heap tracing, and a low upper bound on memory overhead? The Immix collector answered in the affirmative for the first two, but the granularity at which it reclaims memory means that in the worst case a tiny object can keep two 128-byte lines of memory from being re-used for allocation. This paper takes Immix to an extreme of precision, allowing all free space between objects to be reclaimed, down to the limit of the allocator's minimum alignment. We present the design of this Nofl layout, build a collector library around it, and build a new Scheme-to-C compiler as a workbench. We make a first evaluation of the Nofl-based mostly-marking collector when compared to standard copying and mark-sweep collectors and run against a limited set of microbenchmarks, finding that Nofl outperforms the others for tight-to-adequate heap sizes.

Paper Structure

This paper contains 26 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Lazy sweeping over the mark array of a Nofl block: (1) Initially, the the sweep pointer s and the allocation pointer a are at the beginning of the block's mark table. (2) The allocator begins looking for a hole by scanning for a byte with the current mark (M2), or end-of-block. It finds the end-of-hole at offset 2, so it advances s to 2 and clears the mark table between a and s. (3) We allocate a single-word object, so we write the young and end-of-object markers at a, then advance a by one. (4) We go to allocate a two-word object; the hole is too small, as s$-$a is 1. Sweeping advances s over live objects marked with M2 by scanning for end-of-block, repeats as long as s points to a live object, sets a to s to start the hole, scans forward for end-of-hole as in (2) and allocates the object as in (3).
  • Figure 2: Wall-clock time overheads imposed by the different collectors for benchmarks run with a single mutator thread. The vertical axes show lower-bound overheads (LBO): a ratio of total time divided by the minimum observed time to complete the benchmark, not counting GC pauses.
  • Figure 3: Wall-clock time overheads as in Figure \ref{['fig:elapsed-lbo-1']}, but with eight mutator threads and heap sizes scaled up by eight.
  • Figure 4: CPU time overheads, single mutator thread. Compare to Figure \ref{['fig:elapsed-lbo-1']} which measures wall-clock time instead.
  • Figure 5: CPU time overheads, eight mutator threads. Compare to Figure \ref{['fig:elapsed-lbo-8']} which measures wall-clock time instead.