Table of Contents
Fetching ...

Crash-Consistent Checkpointing for AI Training on macOS/APFS

Juha Jeon

TL;DR

This work evaluates crash-consistent checkpointing for AI training on macOS/APFS by implementing three durability modes (unsafe, atomic_nodirsync, atomic_dirsync) and a multi-file manifest-commit protocol, paired with a format-agnostic integrity guard. It uses a fault-injection harness to quantify crash resilience and corruption-detection effectiveness, finding that unsafe writes lose all checkpoints under crashes, atomic modes maintain 100% crash-consistency, and the integrity guard detects 99.8–100% of corruptions with zero false positives. The study reports overheads of 56.5% for atomic_nodirsync and 84.2% (median) for atomic_dirsync relative to unsafe, with tail overhead up to 570.6%, yet latency remains sub-20 ms per checkpoint in most cases, making the approach practical for many training intervals. The results provide deployment guidance, showing how to balance durability and performance, and demonstrate that cross-layer observability and defense-in-depth are essential for reliable AI infrastructure. The work also outlines future directions including cross-filesystem validation, real-world workloads, integration with modern checkpoint systems, and formal modeling to generalize the guarantees beyond APFS.

Abstract

Deep learning training relies on periodic checkpoints to recover from failures, but unsafe checkpoint installation can leave corrupted files on disk. This paper presents an experimental study of checkpoint installation protocols and integrity validation for AI training on macOS/APFS. We implement three write modes with increasing durability guarantees: unsafe (baseline, no fsync), atomic_nodirsync (file-level durability via fsync()), and atomic_dirsync (file + directory durability). We design a format-agnostic integrity guard using SHA-256 checksums with automatic rollback. Through controlled experiments including crash injection (430 unsafe-mode trials) and corruption injection (1,600 atomic-mode trials), we demonstrate that the integrity guard detects 99.8-100% of corruptions with zero false positives. Performance overhead is 56.5-108.4% for atomic_nodirsync and 84.2-570.6% for atomic_dirsync relative to the unsafe baseline. Our findings quantify the reliability-performance trade-offs and provide deployment guidance for production AI infrastructure.

Crash-Consistent Checkpointing for AI Training on macOS/APFS

TL;DR

This work evaluates crash-consistent checkpointing for AI training on macOS/APFS by implementing three durability modes (unsafe, atomic_nodirsync, atomic_dirsync) and a multi-file manifest-commit protocol, paired with a format-agnostic integrity guard. It uses a fault-injection harness to quantify crash resilience and corruption-detection effectiveness, finding that unsafe writes lose all checkpoints under crashes, atomic modes maintain 100% crash-consistency, and the integrity guard detects 99.8–100% of corruptions with zero false positives. The study reports overheads of 56.5% for atomic_nodirsync and 84.2% (median) for atomic_dirsync relative to unsafe, with tail overhead up to 570.6%, yet latency remains sub-20 ms per checkpoint in most cases, making the approach practical for many training intervals. The results provide deployment guidance, showing how to balance durability and performance, and demonstrate that cross-layer observability and defense-in-depth are essential for reliable AI infrastructure. The work also outlines future directions including cross-filesystem validation, real-world workloads, integration with modern checkpoint systems, and formal modeling to generalize the guarantees beyond APFS.

Abstract

Deep learning training relies on periodic checkpoints to recover from failures, but unsafe checkpoint installation can leave corrupted files on disk. This paper presents an experimental study of checkpoint installation protocols and integrity validation for AI training on macOS/APFS. We implement three write modes with increasing durability guarantees: unsafe (baseline, no fsync), atomic_nodirsync (file-level durability via fsync()), and atomic_dirsync (file + directory durability). We design a format-agnostic integrity guard using SHA-256 checksums with automatic rollback. Through controlled experiments including crash injection (430 unsafe-mode trials) and corruption injection (1,600 atomic-mode trials), we demonstrate that the integrity guard detects 99.8-100% of corruptions with zero false positives. Performance overhead is 56.5-108.4% for atomic_nodirsync and 84.2-570.6% for atomic_dirsync relative to the unsafe baseline. Our findings quantify the reliability-performance trade-offs and provide deployment guidance for production AI infrastructure.

Paper Structure

This paper contains 60 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Per-group checkpoint latency percentiles (p50/p90/p99) for each write protocol.
  • Figure 2: CDF of group-checkpoint latency for each protocol.
  • Figure 3: Group atomicity under crash injection. Bars show the fraction of checkpoint groups that remain usable after a crash (95% Wilson CIs).
  • Figure 4: Failure reason breakdown for unsafe mode under crash injection.
  • Figure 5: Corruption detection rates with 95% CIs for each fault type (atomic writes only).
  • ...and 1 more figures