Machine Learning Models Have a Supply Chain Problem
Sarah Meiklejohn, Hayden Blauzvern, Mihai Maruseac, Spencer Schrock, Laurent Simon, Ilia Shumailov
TL;DR
The paper identifies substantial supply-chain risks in the open ML model ecosystem, including tampering and data provenance concerns, and argues for verifiable transparency. It presents two practical defenses: model signing to bind published models to their publishers and a verifiable dataset framework that uses VRFs, Merkle-tree accumulators, and zero-knowledge sets to prove data inclusion without revealing sensitive data. A Python implementation demonstrates the feasibility of model signing and analyzes hashing costs for large models, while the dataset-verifiability design is evaluated across large data scales to assess performance and scalability. Together, these steps aim to enhance trust for users and regulators by enabling verifiable integrity and provenance in open ML model hubs, though the authors acknowledge that much work remains to realize broad adoption and robust guarantees.
Abstract
Powerful machine learning (ML) models are now readily available online, which creates exciting possibilities for users who lack the deep technical expertise or substantial computing resources needed to develop them. On the other hand, this type of open ecosystem comes with many risks. In this paper, we argue that the current ecosystem for open ML models contains significant supply-chain risks, some of which have been exploited already in real attacks. These include an attacker replacing a model with something malicious (e.g., malware), or a model being trained using a vulnerable version of a framework or on restricted or poisoned data. We then explore how Sigstore, a solution designed to bring transparency to open-source software supply chains, can be used to bring transparency to open ML models, in terms of enabling model publishers to sign their models and prove properties about the datasets they use.
