Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake

Paul Borrill

Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake

Paul Borrill

TL;DR

This work model checkpoint execution in a process-algebraic framework and proves that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary, and sketches a bilateral convergence protocol, inspired by Open Atomic Ethernet, that achieves convergence without requiring constraint semantics.

Abstract

Large-scale AI/ML training systems depend on two assumptions that are rarely examined: (1) that checkpoints represent atomic snapshots of global training state, and (2) that infrastructure updates can be applied without inducing mixed-protocol cluster states. Both assumptions are instances of a deeper structural error: the Forward-In-Time-Only (FITO) category mistake, which confuses protocol convergence properties with temporal predicates. We formalize this confusion as a type error: the identification of a temporal snapshot $\mathsf{Snap}(t)$ with a convergence property $\mathsf{Conv}(\mathcal{P},e)$. We model checkpoint execution in a process-algebraic framework and prove that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary. We reformulate checkpoint inconsistency on an epoch lattice and show that atomicity is a measure-zero event whose complement grows exponentially with the number of independent persistence domains. We formalize mixed-epoch recovery as a type violation in the optimization algebra and show that the resulting update is not a valid step of any standard optimizer. For firmware fleet updates, we strengthen the known consensus-hardness result: atomic deployment requires not merely agreement but common knowledge of the epoch transition, which is strictly unattainable in asynchronous systems with unreliable communication. We conclude by sketching a bilateral convergence protocol, inspired by Open Atomic Ethernet, that achieves $\mathsf{Conv}(\mathcal{P},e)$ without requiring $\mathsf{Snap}(t)$ -- replacing the FITO assumption with constraint semantics.

Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake

TL;DR

Abstract

with a convergence property

. We model checkpoint execution in a process-algebraic framework and prove that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary. We reformulate checkpoint inconsistency on an epoch lattice and show that atomicity is a measure-zero event whose complement grows exponentially with the number of independent persistence domains. We formalize mixed-epoch recovery as a type violation in the optimization algebra and show that the resulting update is not a valid step of any standard optimizer. For firmware fleet updates, we strengthen the known consensus-hardness result: atomic deployment requires not merely agreement but common knowledge of the epoch transition, which is strictly unattainable in asynchronous systems with unreliable communication. We conclude by sketching a bilateral convergence protocol, inspired by Open Atomic Ethernet, that achieves

without requiring

-- replacing the FITO assumption with constraint semantics.

Paper Structure (34 sections, 11 theorems, 23 equations)

This paper contains 34 sections, 11 theorems, 23 equations.

Introduction
The FITO Category Mistake
System Model and Definitions
Training state as distributed product type
Checkpoint as process composition
Failure model
Trace property vs. state predicate
Non-Existence of a Temporal Snapshot Boundary
Checkpoint Inconsistency and the Epoch Lattice
The epoch lattice
Atomicity as a measure-zero event
The ternary model with epistemic ambiguity
Concrete parameters
Semantic Causality as Type Violation
The optimization type system
...and 19 more sections

Key Result

Proposition 2.5

The identification is a category mistake. $\mathsf{Snap}(t,e)$ is a temporal predicate (logical type: proposition about a time point). $\mathsf{Conv}(\mathcal{P},e)$ is a protocol property (logical type: proposition about a computation history). The two belong to different logical types and cannot be identified withou

Theorems & Definitions (47)

Definition 2.1: Category mistake (Ryle)
Definition 2.2: FITO ordering borrill_message_passing
Definition 2.3: Temporal snapshot predicate
Definition 2.4: Protocol convergence property
Proposition 2.5: The FITO category mistake in checkpointing
proof
Remark 2.6
Definition 3.1: Persistence process
Definition 3.2: Checkpoint protocol
Proposition 3.3: Committed state is a trace property
...and 37 more

Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake

TL;DR

Abstract

Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (47)