Table of Contents
Fetching ...

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

Yufei Li, Yu Fu, Yue Dong, Cong Liu

TL;DR

MACE addresses the latency-accuracy tradeoff in edge LLM serving by colocating inference and continual retraining within a single GPU and orchestrating them with an iteration-level, memory-aware scheduler. The approach combines alignment-aware prioritization, best-fit memory batching, and cache-aware optimizations (prefix sharing and KV prune) to co-schedule prefill, decode, and fine-tune tasks, using LoRA adapters for personalization. Empirical results on real edge/server hardware and diverse datasets show that MACE delivers improved alignment metrics while achieving substantial latency reductions (up to 63% in prefill and 30% in decode) and high GPU utilization (above 85% on Orin). The work advances practical learning-while-serving for edge deployments, with implications for real-time personalization and safety-aligned responses.

Abstract

Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data necessitates frequent retraining, which introduces a fundamental tension between inference latency and model accuracy under constrained GPU resources. Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity. In this paper, we identify that iteration-level scheduling is crucial for adapting retraining frequency to model drift without violating service-level objectives (SLOs). We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput. MACE leverages the insight that not all model updates equally affect output alignment and allocates GPU cycles accordingly to balance throughput, latency, and update freshness. Our trace-driven evaluation shows that MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. Compared to periodic retraining, MACE improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% in NVIDIA AGX Orin. These results demonstrate that iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms.

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

TL;DR

MACE addresses the latency-accuracy tradeoff in edge LLM serving by colocating inference and continual retraining within a single GPU and orchestrating them with an iteration-level, memory-aware scheduler. The approach combines alignment-aware prioritization, best-fit memory batching, and cache-aware optimizations (prefix sharing and KV prune) to co-schedule prefill, decode, and fine-tune tasks, using LoRA adapters for personalization. Empirical results on real edge/server hardware and diverse datasets show that MACE delivers improved alignment metrics while achieving substantial latency reductions (up to 63% in prefill and 30% in decode) and high GPU utilization (above 85% on Orin). The work advances practical learning-while-serving for edge deployments, with implications for real-time personalization and safety-aligned responses.

Abstract

Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data necessitates frequent retraining, which introduces a fundamental tension between inference latency and model accuracy under constrained GPU resources. Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity. In this paper, we identify that iteration-level scheduling is crucial for adapting retraining frequency to model drift without violating service-level objectives (SLOs). We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput. MACE leverages the insight that not all model updates equally affect output alignment and allocates GPU cycles accordingly to balance throughput, latency, and update freshness. Our trace-driven evaluation shows that MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. Compared to periodic retraining, MACE improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% in NVIDIA AGX Orin. These results demonstrate that iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms.

Paper Structure

This paper contains 22 sections, 6 equations, 15 figures, 2 tables, 2 algorithms.

Figures (15)

  • Figure 1: Requests A, B, C, D arrive over time. Subscripts $p$ and $d$ indicate prefill and decode iterations, while $ft$ marks fine-tuning. Periodic retraining delays model updates for A due to inference priority. Sync retraining preempts decodes for B, C, D. Async (hybrid) schedule colocates A$_{ft}$, B$_{d}$, C$_{d}$, D$_{d}$ into the same iteration, reducing latency and ensuring B, C, D benefit (in subsequent iterations) from the updated model—without stalling either workload.
  • Figure 2: Win rate and CLPD over time when serving Mistral-7B on Left: RLHF and Right: SHP dataset. The inset bar plots show the average metric value of each method, highlighting the overall performance difference beyond temporal variations.
  • Figure 3: Left: latency and Right: memory per-token of three workloads for Mistral-7B on A6000 Ada across varying batch sizes.
  • Figure 4: Abstracted memory–latency footprint for three workloads. Hybrid scheduling together with pruning techniques mitigates memory fragmentation and increases concurrency.
  • Figure 5: Left: GPU utilization, and Right: Latency breakdown on two (a) A6000 Ada server and (b) AGX Orin edge device.
  • ...and 10 more figures