Crash-Consistent Checkpointing for AI Training on macOS/APFS
Juha Jeon
TL;DR
This work evaluates crash-consistent checkpointing for AI training on macOS/APFS by implementing three durability modes (unsafe, atomic_nodirsync, atomic_dirsync) and a multi-file manifest-commit protocol, paired with a format-agnostic integrity guard. It uses a fault-injection harness to quantify crash resilience and corruption-detection effectiveness, finding that unsafe writes lose all checkpoints under crashes, atomic modes maintain 100% crash-consistency, and the integrity guard detects 99.8–100% of corruptions with zero false positives. The study reports overheads of 56.5% for atomic_nodirsync and 84.2% (median) for atomic_dirsync relative to unsafe, with tail overhead up to 570.6%, yet latency remains sub-20 ms per checkpoint in most cases, making the approach practical for many training intervals. The results provide deployment guidance, showing how to balance durability and performance, and demonstrate that cross-layer observability and defense-in-depth are essential for reliable AI infrastructure. The work also outlines future directions including cross-filesystem validation, real-world workloads, integration with modern checkpoint systems, and formal modeling to generalize the guarantees beyond APFS.
Abstract
Deep learning training relies on periodic checkpoints to recover from failures, but unsafe checkpoint installation can leave corrupted files on disk. This paper presents an experimental study of checkpoint installation protocols and integrity validation for AI training on macOS/APFS. We implement three write modes with increasing durability guarantees: unsafe (baseline, no fsync), atomic_nodirsync (file-level durability via fsync()), and atomic_dirsync (file + directory durability). We design a format-agnostic integrity guard using SHA-256 checksums with automatic rollback. Through controlled experiments including crash injection (430 unsafe-mode trials) and corruption injection (1,600 atomic-mode trials), we demonstrate that the integrity guard detects 99.8-100% of corruptions with zero false positives. Performance overhead is 56.5-108.4% for atomic_nodirsync and 84.2-570.6% for atomic_dirsync relative to the unsafe baseline. Our findings quantify the reliability-performance trade-offs and provide deployment guidance for production AI infrastructure.
