Horizon-LM: A RAM-Centric Architecture for LLM Training

Zhengqing Yuan; Lichao Sun; Yanfang; Ye

Horizon-LM: A RAM-Centric Architecture for LLM Training

Zhengqing Yuan, Lichao Sun, Yanfang, Ye

TL;DR

Horizon-LM tackles memory bottlenecks in large-scale LLM optimization by making host memory the authoritative parameter store and repurposing GPUs as transient compute engines, enabling node-scale training on a single device. It introduces a CPU-master, GPU-template execution model with explicit block-wise recomputation and a pipelined, double-buffered streaming engine to keep GPU memory bounded by per-layer footprints while ensuring host memory scales predictably with model size. The approach is underpinned by formal invariants: $M_{\mathrm{GPU}} = O(P_{\max} + K A_{\max})$ and $M_{\mathrm{CPU}} \approx \sum_i P_i + M_{\mathrm{opt}} + O(P_{\max})$, which decouple capacity from device count and bound resource usage. Empirical evaluations across GH200/H200 and A100 platforms show Horizon-LM can train up to 120B parameters on a single GPU-host node with high throughput and numerical correctness, outperforming CPU-offloading baselines by up to 12.2×, and maintaining stable performance across depth and width scaling. This demonstrates that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model work, motivating a new streaming-centric design space for post-training workloads.

Abstract

The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

Horizon-LM: A RAM-Centric Architecture for LLM Training

TL;DR

and

, which decouple capacity from device count and bound resource usage. Empirical evaluations across GH200/H200 and A100 platforms show Horizon-LM can train up to 120B parameters on a single GPU-host node with high throughput and numerical correctness, outperforming CPU-offloading baselines by up to 12.2×, and maintaining stable performance across depth and width scaling. This demonstrates that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model work, motivating a new streaming-centric design space for post-training workloads.

Abstract

higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

Paper Structure (34 sections, 9 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 9 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Background and Related Work
GPU-Centric Distributed Training Paradigm
Memory Extension via Offloading Frameworks
Design Challenges
Memory Requirement
Bandwidth and Streaming Requirement
Execution and Scheduling Requirement
Design Principles
Training Abstraction: CPU-Master, GPU-Cache
End-to-End Execution Workflow
System Architecture
CPU Domain: Authoritative Parameter Store
GPU Domain: Transient Execution Cache
Architectural Invariants
...and 19 more sections

Figures (8)

Figure 1: Sustained TFLOPS across model scales on a singal GH200 (Qwen2.5 for 7B-32B) and H200 (Qwen2.5 72B and GPT-oss 120B). HorizonLM remains efficient while offloading baselines become GPU memory-bound.
Figure 2: HorizonLM architecture: CPU acts as the parameter store while GPUs execute transient layer templates via asynchronous parameter streaming and gradient offloading.
Figure 3: End-to-end pipelined execution of Horizon-LM across compute, data movement, and CPU optimization.
Figure 4: Double-buffer streaming and slab-based gradient
Figure 5: Host (CPU) memory footprint versus model scale across training systems.
...and 3 more figures

Horizon-LM: A RAM-Centric Architecture for LLM Training

TL;DR

Abstract

Horizon-LM: A RAM-Centric Architecture for LLM Training

Authors

TL;DR

Abstract

Table of Contents

Figures (8)