Validating Traces of Distributed Programs Against TLA+ Specifications

Horatiu Cirstea; Markus A. Kuppe; Benjamin Loillier; Stephan Merz

Validating Traces of Distributed Programs Against TLA+ Specifications

Horatiu Cirstea, Markus A. Kuppe, Benjamin Loillier, Stephan Merz

TL;DR

The paper addresses the challenge of validating distributed program executions against high-level TLA+ specifications by instrumenting implementations to produce JSON traces and by constraining TLA+ specifications with a trace-aware TLC workflow. The approach combines TLATracer instrumentation, an extended TraceSpec module, and constrained model checking to determine trace compatibility, enabling detection of discrepancies without full formal proofs. It demonstrates this on multiple case studies, including Two-Phase Commit, a KV store, Raft-inspired systems, and the Confidential Consortium Framework, showing practical bug discovery and guidance for instrumentation. The work offers industry-grade tooling and workflows for integrating trace validation into development and CI processes, improving confidence in distributed system implementations.

Abstract

TLA+ is a formal language for specifying systems, including distributed algorithms, that is supported by powerful verification tools. In this work we present a framework for relating traces of distributed programs to high-level specifications written in TLA+. The problem is reduced to a constrained model checking problem, realized using the TLC model checker. Our framework consists of an API for instrumenting Java programs in order to record traces of executions, of a collection of TLA+ operators that are used for relating those traces to specifications, and of scripts for running the model checker. Crucially, traces only contain updates to specification variables rather than full values, and developers may choose to trace only certain variables. We have applied our approach to several distributed programs, detecting discrepancies between the specifications and the implementations in all cases. We discuss reasons for these discrepancies, best practices for instrumenting programs, and how to interpret the verdict produced by TLC.

Validating Traces of Distributed Programs Against TLA+ Specifications

TL;DR

Abstract

Paper Structure (18 sections, 9 figures)

This paper contains 18 sections, 9 figures.

Introduction
Background
TLA+ Specifications
Java Implementation
Instrumenting Distributed Programs
Checking Program Traces Against TLA+ Specifications
The trace validation problem
Realizing trace validation using the TLC model checker
Analyzing discrepancies
Experience with trace validation
Implementation aspects
Related Work
Conclusion
Specification of the Two-Phase Commit protocol
Implementation of the Two-Phase Commit protocol
...and 3 more sections

Figures (9)

Figure 1: Overview of trace validation.
Figure 2: TLA+ Specification of Two-Phase Commit.
Figure 3: Java implementation of the RM of the Two-Phase Commit protocol.
Figure 4: Trace validation as a search for paths in the state space.
Figure 5: Number of distinct states explored for valid traces of Two-Phase Commit (TP) and Key-Value Store (KV), for various degrees of precision (1hr timeout).
...and 4 more figures

Validating Traces of Distributed Programs Against TLA+ Specifications

TL;DR

Abstract

Validating Traces of Distributed Programs Against TLA+ Specifications

Authors

TL;DR

Abstract

Table of Contents

Figures (9)