Table of Contents
Fetching ...

ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators

Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han

TL;DR

ODMA addresses inefficient KV-cache memory management for LLM serving on LPDDR5 RACM accelerators by coupling a generation-length predictor, dynamic bucket boundaries learned from live traces, and a large-bucket safety mechanism to balance accuracy, fragmentation, and robustness. The design preserves a contiguous memory layout suitable for LPDDR-class devices while adapting to workload drift, achieving significant gains in prediction accuracy and device utilization, which translate into higher end-to-end throughput. Demonstrated on Cambricon MLU370-X4 with 7B-class models, ODMA improves RPS by ~29% and TPS by ~27%, and increases memory utilization substantially compared to static worst-case pre-allocation. These results show that hardware-aware, predictor-driven allocation can unlock efficient LLM serving on RACM platforms without kernel changes.

Abstract

Serving large language models (LLMs) on accelerators with poor random-access bandwidth (e.g., LPDDR5-based) is limited by current memory managers. Static pre-allocation wastes memory, while fine-grained paging (e.g., PagedAttention) is ill-suited due to high random-access costs. Existing HBM-centric solutions do not exploit the characteristics of random-access-constrained memory (RACM) accelerators like Cambricon MLU370. We present ODMA, an on-demand memory allocation framework for RACM. ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard. Boundaries are periodically updated from live traces to maximize utilization. On Alpaca and Google-NQ, ODMA improves prediction accuracy of prior work significantly (e.g., from 82.68% to 93.36%). Serving DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, ODMA raises memory utilization from 55.05% to 72.45% and improves RPS and TPS by 29% and 27% over static baselines. This demonstrates that hardware-aware allocation unlocks efficient LLM serving on RACM platforms.

ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators

TL;DR

ODMA addresses inefficient KV-cache memory management for LLM serving on LPDDR5 RACM accelerators by coupling a generation-length predictor, dynamic bucket boundaries learned from live traces, and a large-bucket safety mechanism to balance accuracy, fragmentation, and robustness. The design preserves a contiguous memory layout suitable for LPDDR-class devices while adapting to workload drift, achieving significant gains in prediction accuracy and device utilization, which translate into higher end-to-end throughput. Demonstrated on Cambricon MLU370-X4 with 7B-class models, ODMA improves RPS by ~29% and TPS by ~27%, and increases memory utilization substantially compared to static worst-case pre-allocation. These results show that hardware-aware, predictor-driven allocation can unlock efficient LLM serving on RACM platforms without kernel changes.

Abstract

Serving large language models (LLMs) on accelerators with poor random-access bandwidth (e.g., LPDDR5-based) is limited by current memory managers. Static pre-allocation wastes memory, while fine-grained paging (e.g., PagedAttention) is ill-suited due to high random-access costs. Existing HBM-centric solutions do not exploit the characteristics of random-access-constrained memory (RACM) accelerators like Cambricon MLU370. We present ODMA, an on-demand memory allocation framework for RACM. ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard. Boundaries are periodically updated from live traces to maximize utilization. On Alpaca and Google-NQ, ODMA improves prediction accuracy of prior work significantly (e.g., from 82.68% to 93.36%). Serving DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, ODMA raises memory utilization from 55.05% to 72.45% and improves RPS and TPS by 29% and 27% over static baselines. This demonstrates that hardware-aware allocation unlocks efficient LLM serving on RACM platforms.

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: ODMA overview. User prompts are annotated by the Predictor and inserted into a Task Pool. The Scheduler groups tagged tasks into batches and sends them to the Runtime, which interacts with the Allocator. The Allocator (with Cluster Manager and per-device Memory Pools) allocates bucket-tagged contiguous blocks on LPDDR-class accelerators.
  • Figure 2: Predictor pipeline. Prompt features and metadata are fed into a lightweight encoder to produce a length estimate $\hat{L}$ and uncertainty $u$. These outputs determine the bucket tag attached to each request.
  • Figure 3: Prediction accuracy: ODMA vs. S3 s3. Left: Alpaca taori2023alpaca; Right: Google-NQ kwiatkowski2019natural.
  • Figure 4: Throughput improvement of ODMA over a static pre-allocation baseline (Cambricon-vLLM). Left: RPS; Right: TPS.
  • Figure 5: Device-memory utilization with ODMA. Left: Alpaca; Right: Google-NQ.