Cloud Uptime Archive: Open-Access Availability Data of Web, Cloud, and Gaming Services
Sacheendra Talluri, Dante Niewenhuis, Xiaoyu Chu, Jakob Kyselica, Mehmet Cetin, Alexander Balgavy, Alexandru Iosup
TL;DR
This paper introduces the Cloud Uptime Archive (CUA), an open, multi-vantage-point repository of cloud, web, and online gaming uptime data designed to standardize reliability evaluation. It details data collection from operator pages, crowdsourced reports, and activity metrics, plus automated (LLM-assisted) and manual failure/explanation extraction, culminating in a normalized trace format for easy reuse in experiments. Through MTBF/MTTR analyses, time-pattern studies, and severity/duration correlations, the work demonstrates distinct biases between operator and user reports and provides a unified severity framework for cross-source comparisons. The authors validate data coherence across sources, compare against established datasets, and showcase practical utility by using traces to evaluate checkpointing and retry strategies in HPC and service-based applications via trace-driven simulation. All data and tooling are released openly, enabling broader use in reliability research and fault-tolerance design.
Abstract
Cloud services are critical to society. However, their reliability is poorly understood. Towards solving the problem, we propose a standard repository for cloud uptime data. We populate this repository with the data we collect containing failure reports from users and operators of cloud services, web services, and online games. The multiple vantage points help reduce bias from individual users and operators. We compare our new data to existing failure data from the Failure Trace Archive and the Google cluster trace. We analyze the MTBF and MTTR, time patterns, failure severity, user-reported symptoms, and operator-reported symptoms of failures in the data we collect. We observe that high-level user facing services fail less often than low-level infrastructure services, likely due to them using fault-tolerance techniques. We use simulation-based experiments to demonstrate the impact of different failure traces on the performance of checkpointing and retry mechanisms. We release the data, and the analysis and simulation tools, as open-source artifacts available at https://github.com/atlarge-research/cloud-uptime-archive .
