Table of Contents
Fetching ...

Verde: Verification via Refereed Delegation for Machine Learning Programs

Arasu Arun, Adam St. Arnaud, Alexey Titov, Brian Wilcox, Viktor Kolobaric, Marc Brinkmann, Oguzhan Ersoy, Ben Fielding, Joseph Bonneau

TL;DR

Verde tackles correctness guarantees for ML programs delegated to untrusted compute providers by introducing a refereed-delegation framework tailored to neural networks. It combines Verde for dispute resolution and RepOps for bitwise reproducibility across hardware, achieving the required trust with practical overheads. The two-phase Verde protocol first localizes divergence to a training step and then to a specific operator in the graph, while RepOps eliminates hardware-induced nondeterminism by enforcing fixed operation ordering. The work demonstrates implementation and evaluation in PyTorch/ONNX with CUDA, showing that the resulting guarantees are feasible for inference and training workloads and offering a path to auditing and blockchain-based deployments.

Abstract

Machine learning programs, such as those performing inference, fine-tuning, and training of LLMs, are commonly delegated to untrusted compute providers. To provide correctness guarantees for the client, we propose adapting the cryptographic notion of refereed delegation to the machine learning setting. This approach enables a computationally limited client to delegate a program to multiple untrusted compute providers, with a guarantee of obtaining the correct result if at least one of them is honest. Refereed delegation of ML programs poses two technical hurdles: (1) an arbitration protocol to resolve disputes when compute providers disagree on the output, and (2) the ability to bitwise reproduce ML programs across different hardware setups, For (1), we design Verde, a dispute arbitration protocol that efficiently handles the large scale and graph-based computational model of modern ML programs. For (2), we build RepOps (Reproducible Operators), a library that eliminates hardware "non-determinism" by controlling the order of floating point operations performed on all hardware. Our implementation shows that refereed delegation achieves both strong guarantees for clients and practical overheads for compute providers.

Verde: Verification via Refereed Delegation for Machine Learning Programs

TL;DR

Verde tackles correctness guarantees for ML programs delegated to untrusted compute providers by introducing a refereed-delegation framework tailored to neural networks. It combines Verde for dispute resolution and RepOps for bitwise reproducibility across hardware, achieving the required trust with practical overheads. The two-phase Verde protocol first localizes divergence to a training step and then to a specific operator in the graph, while RepOps eliminates hardware-induced nondeterminism by enforcing fixed operation ordering. The work demonstrates implementation and evaluation in PyTorch/ONNX with CUDA, showing that the resulting guarantees are feasible for inference and training workloads and offering a path to auditing and blockchain-based deployments.

Abstract

Machine learning programs, such as those performing inference, fine-tuning, and training of LLMs, are commonly delegated to untrusted compute providers. To provide correctness guarantees for the client, we propose adapting the cryptographic notion of refereed delegation to the machine learning setting. This approach enables a computationally limited client to delegate a program to multiple untrusted compute providers, with a guarantee of obtaining the correct result if at least one of them is honest. Refereed delegation of ML programs poses two technical hurdles: (1) an arbitration protocol to resolve disputes when compute providers disagree on the output, and (2) the ability to bitwise reproduce ML programs across different hardware setups, For (1), we design Verde, a dispute arbitration protocol that efficiently handles the large scale and graph-based computational model of modern ML programs. For (2), we build RepOps (Reproducible Operators), a library that eliminates hardware "non-determinism" by controlling the order of floating point operations performed on all hardware. Our implementation shows that refereed delegation achieves both strong guarantees for clients and practical overheads for compute providers.

Paper Structure

This paper contains 18 sections, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: Extended computational graph for a neural network with a single operator. Yellow represents nodes that initialize tensor values from either the training data or the a training checkpoint. Blue nodes are forward pass operators, and red ones are backward pass operators. For clarity, we label the edge transferring context (also called "saved tensors"' in autograd) from the forward pass to the corresponding backward pass operator. In the dispute resolution algorithm, these nodes are specified as AugmentedCGNode objects.
  • Figure 2: The nodes of the computational graph of the latest training step serve as the checkpoint used in Phase 1. It's committed to using a Merkle (binary hash) tree and verified by the referee in Phase 2 (line 7). Merkle trees provide efficient proofs of membership for its leaves, facilitating efficient dispute resolution when trainers disagree on the values from the weights, optimizer state, or training data.
  • Figure 3: RepOps overhead for matrix-matrix multiplication.