Table of Contents
Fetching ...

Alchemist: Towards the Design of Efficient Online Continual Learning System

Yuyang Huang, Yuhan Liu, Haryadi S. Gunawi, Beibin Li, Changho Hwang

TL;DR

Alchemist tackles the inefficiency of online continual learning by reusing serving activations during training. It introduces minimal activation recording during prefill and a memory-aware offloader with scheduling and hedging to maintain serving latency and capacity while boosting throughput. Empirical results show up to $1.72\times$ training throughput gains, up to $47\%$ memory reduction, and up to $2\times$ more trainable tokens, with only modest serving overhead. This approach offers a practical path to faster, more scalable online updates for large language models in real-world cloud deployments.

Abstract

Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.

Alchemist: Towards the Design of Efficient Online Continual Learning System

TL;DR

Alchemist tackles the inefficiency of online continual learning by reusing serving activations during training. It introduces minimal activation recording during prefill and a memory-aware offloader with scheduling and hedging to maintain serving latency and capacity while boosting throughput. Empirical results show up to training throughput gains, up to memory reduction, and up to more trainable tokens, with only modest serving overhead. This approach offers a practical path to faster, more scalable online updates for large language models in real-world cloud deployments.

Abstract

Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.

Paper Structure

This paper contains 28 sections, 11 figures.

Figures (11)

  • Figure 1: Lifecycle of modern AI services.
  • Figure 2: Activations in forward and backward passes of model training. Activations, $a$, computed during forward pass will be stored and reference during backward pass to avoid redundant recomputation. The figure does not fully reflect what actually happen during training but is simplified only for illustration purpose. §\ref{['motiv-model-inference']}
  • Figure 3: Continual training time breakdown. When continual training on served inputs, each iteration of training with DPO loss spends 30% of the total time, shown in hatched red color, recomputing the same activations that have been calculated during serving. For pre-training with cross entropy loss, this number increases to 43%. This substantial amount of recomputation time greatly motivates us to reuse the activations that has been calculated during serving. § \ref{['motiv-online-cont-learn-sys']}
  • Figure 4: Alchemist system overview. ➊ Alchemist injects preemption hooks to switch from training context to serving context upon query arrival. ➋ Alchemist saves activations and other cache-able data specified by users calculated during serving jobs for later training when labels are ready. ➌ Alchemist asynchronously copies serving activations to host memory. ➍ Alchemist trainer calls users customized training function which pulls activations and labels when ready. ➎ Alchemist frees activations when serving query arrives amid training job and requires more memory. §\ref{['design-intro']}
  • Figure 5: Latency overhead of activations recording. Due to the cost in recording and saving activations and computation graphs, autograd frameworks like torch.autograd could bring up 21% overhead to the prefill phase and 35% overhead to each forward pass in the decode phase (i.e., 35% increment in each token's generation time). If enabled for each token generated, it will significantly prolong the serving latency, violating our latency requirement. § \ref{['design-cache']}
  • ...and 6 more figures