Alchemist: Towards the Design of Efficient Online Continual Learning System

Yuyang Huang; Yuhan Liu; Haryadi S. Gunawi; Beibin Li; Changho Hwang

Alchemist: Towards the Design of Efficient Online Continual Learning System

Yuyang Huang, Yuhan Liu, Haryadi S. Gunawi, Beibin Li, Changho Hwang

TL;DR

Alchemist tackles the inefficiency of online continual learning by reusing serving activations during training. It introduces minimal activation recording during prefill and a memory-aware offloader with scheduling and hedging to maintain serving latency and capacity while boosting throughput. Empirical results show up to $1.72\times$ training throughput gains, up to $47\%$ memory reduction, and up to $2\times$ more trainable tokens, with only modest serving overhead. This approach offers a practical path to faster, more scalable online updates for large language models in real-world cloud deployments.

Abstract

Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.

Alchemist: Towards the Design of Efficient Online Continual Learning System

TL;DR

Abstract

Alchemist: Towards the Design of Efficient Online Continual Learning System

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)