Table of Contents
Fetching ...

Futureproof Static Memory Planning

Christos Lamprakos, Panagiotis Xanthopoulos, Manolis Katsaragakis, Sotirios Xydis, Dimitrios Soudris, Francky Catthoor

TL;DR

This work reframes dynamic storage allocation as a static, NP-complete offset-assignment problem and introduces idealloc, a scalable allocator that integrates a corrected and extended Boxing Algorithm to handle million-buffer inputs. It combines theoretical refinements (latent invariants, critical-point handling) with a practical, parallelizable design, and evaluates on a new, challenging benchmark suite showing strong robustness and competitive efficiency against production solvers. The approach achieves high-quality placements with low fragmentation while maintaining fast per-iteration latency, and it provides open-source availability and a detailed discussion of future directions in large-scale memory planning. Together, these contributions offer a principled, scalable path toward future-proof static memory planning in deep learning and high-performance systems.

Abstract

The NP-complete combinatorial optimization task of assigning offsets to a set of buffers with known sizes and lifetimes so as to minimize total memory usage is called dynamic storage allocation (DSA). Existing DSA implementations bypass the theoretical state-of-the-art algorithms in favor of either fast but wasteful heuristics, or memory-efficient approaches that do not scale beyond one thousand buffers. The "AI memory wall", combined with deep neural networks' static architecture, has reignited interest in DSA. We present idealloc, a low-fragmentation, high-performance DSA implementation designed for million-buffer instances. Evaluated on a novel suite of particularly hard benchmarks from several domains, idealloc ranks first against four production implementations in terms of a joint effectiveness/robustness criterion.

Futureproof Static Memory Planning

TL;DR

This work reframes dynamic storage allocation as a static, NP-complete offset-assignment problem and introduces idealloc, a scalable allocator that integrates a corrected and extended Boxing Algorithm to handle million-buffer inputs. It combines theoretical refinements (latent invariants, critical-point handling) with a practical, parallelizable design, and evaluates on a new, challenging benchmark suite showing strong robustness and competitive efficiency against production solvers. The approach achieves high-quality placements with low fragmentation while maintaining fast per-iteration latency, and it provides open-source availability and a detailed discussion of future directions in large-scale memory planning. Together, these contributions offer a principled, scalable path toward future-proof static memory planning in deep learning and high-performance systems.

Abstract

The NP-complete combinatorial optimization task of assigning offsets to a set of buffers with known sizes and lifetimes so as to minimize total memory usage is called dynamic storage allocation (DSA). Existing DSA implementations bypass the theoretical state-of-the-art algorithms in favor of either fast but wasteful heuristics, or memory-efficient approaches that do not scale beyond one thousand buffers. The "AI memory wall", combined with deep neural networks' static architecture, has reignited interest in DSA. We present idealloc, a low-fragmentation, high-performance DSA implementation designed for million-buffer instances. Evaluated on a novel suite of particularly hard benchmarks from several domains, idealloc ranks first against four production implementations in terms of a joint effectiveness/robustness criterion.

Paper Structure

This paper contains 42 sections, 23 equations, 17 figures, 3 tables, 3 algorithms.

Figures (17)

  • Figure 1: A more detailed illustration of the dynamic storage allocation (DSA) problem. This instance comprises five buffers and a (suboptimal) solution, i.e., offset assignment to each of the buffers, is depicted.
  • Figure 2: Interval Graph Coloring.
  • Figure 3: First-fit placement.
  • Figure 4: An illustration of BA's main idea, that is, boxing buffers into Matryoshkas. The buffers on the left have $4!=24$ possible orderings. By boxing them into two distinct groups the number of possible orderings has been reduced by a factor of $3$. In their paper, Buchsbaum et al. do not care about this reduction in complexity; they use the boxes to reason about worst-case fragmentation.
  • Figure 5: T16 flow diagram.
  • ...and 12 more figures