FastMig: Leveraging FastFreeze to Establish Robust Service Liquidity in Cloud 2.0
Sorawit Manatura, Thanawat Chanikaphon, Chantana Chantrapornchai, Mohsen Amini Salehi
TL;DR
The paper addresses the challenge of achieving robust service liquidity for Cloud 2.0 by enabling low-downtime live migration of containerized services across edge-to-cloud and multi-cloud environments. It proposes FastMig, a platform that augments the FastFreeze checkpoint/restore workflow with a decoupled service-management layer, a fault-tolerance mechanism, and a warm restoration technique, accessible via an HTTP API. Key contributions include the FastFreeze Daemon, a configurable restart policy, improved restoration for multi-process services, the extension to support warm restoration, and comprehensive evaluation showing substantial downtime reductions with minimal overhead. The work demonstrates practical impact for federated learning, edge-cloud mobility, and dynamic resource management by enabling reliable, low-latency service relocation across distributed environments.
Abstract
Service liquidity across edge-to-cloud or multi-cloud will serve as the cornerstone of the next generation of cloud computing systems (Cloud 2.0). Provided that cloud-based services are predominantly containerized, an efficient and robust live container migration solution is required to accomplish service liquidity. In a nod to this growing requirement, in this research, we leverage FastFreeze, a popular platform for process checkpoint/restore within a container, and promote it to be a robust solution for end-to-end live migration of containerized services. In particular, we develop a new platform, called FastMig that proactively controls the checkpoint/restore operations of FastFreeze, thereby, allowing for robust live migration of containerized services via standard HTTP interfaces. The proposed platform introduces post-checkpointing and pre-restoration operations to enhance migration robustness. Notably, the pre-restoration operation includes containerized service startup options, enabling warm restoration and reducing the migration downtime. In addition, we develop a method to make FastFreeze robust against failures that commonly happen during the migration and even during the normal operation of a containerized service. Experimental results under real-world settings show that the migration downtime of a containerized service can be reduced by 30X compared to the situation where the original FastFreeze was deployed for the migration. Moreover, we demonstrate that FastMig and warm restoration method together can significantly mitigate the container startup overhead. Importantly, these improvements are achieved without any significant performance reduction and only incurs a small resource usage overhead, compared to the bare (\ie non-FastFreeze) containerized services.
