Vmem: A Lightweight Hot-Upgradable Memory Management for In-production Cloud Environment
Hao Zheng, Qiang Wang, Longxiang Wang, Xishi Qiu, Yibin Shen, Xiaoshe Dong, Naixuan Guan, Jia Wei, Fudong Qiu, Xingjun Zhang, Yun Xu, Mao Zhao, Yisheng Xie, Shenglong Zhao, Min He, Yu Li, Xiao Zheng, Ben Luo, Jiesheng Wu
TL;DR
Vmem introduces a modular, hot-upgradable memory management architecture designed for in-production clouds. By decoupling VM memory from host OS using elastic reserved memory, slice-based allocation, and FastMap bidirectional translation, it significantly reduces metadata overhead while enabling rapid elasticity and online upgrades. The system demonstrates near-Hugetlb performance, substantial improvements in sellable memory, and dramatic reductions in VM startup time, with proven production deployment over seven years across 300,000 servers and hundreds of millions of VMs. The work also documents clear commercial benefits, including tens of gigabytes of reclaimed host memory per server and multi-hundred-petabyte-scale potential, along with robust portability and a clear path for extension and integration into cloud stacks.
Abstract
Traditional memory management suffers from metadata overhead, architectural complexity, and stability degradation, problems intensified in cloud environments. Existing software/hardware optimizations are insufficient for cloud computing's dual demands of flexibility and low overhead. This paper presents Vmem, a memory management architecture for in-production cloud environments that enables flexible, efficient cloud server memory utilization through lightweight reserved memory management. Vmem is the first such architecture to support online upgrades, meeting cloud requirements for high stability and rapid iterative evolution. Experiments show Vmem increases sellable memory rate by about 2%, delivers extreme elasticity and performance, achieves over 3x faster boot time for VFIO-based virtual machines (VMs), and improves network performance by about 10% for DPU-accelerated VMs. Vmem has been deployed at large scale for seven years, demonstrating efficiency and stability on over 300,000 cloud servers supporting hundreds of millions of VMs.
