Table of Contents
Fetching ...

Machine Learning Models Have a Supply Chain Problem

Sarah Meiklejohn, Hayden Blauzvern, Mihai Maruseac, Spencer Schrock, Laurent Simon, Ilia Shumailov

TL;DR

The paper identifies substantial supply-chain risks in the open ML model ecosystem, including tampering and data provenance concerns, and argues for verifiable transparency. It presents two practical defenses: model signing to bind published models to their publishers and a verifiable dataset framework that uses VRFs, Merkle-tree accumulators, and zero-knowledge sets to prove data inclusion without revealing sensitive data. A Python implementation demonstrates the feasibility of model signing and analyzes hashing costs for large models, while the dataset-verifiability design is evaluated across large data scales to assess performance and scalability. Together, these steps aim to enhance trust for users and regulators by enabling verifiable integrity and provenance in open ML model hubs, though the authors acknowledge that much work remains to realize broad adoption and robust guarantees.

Abstract

Powerful machine learning (ML) models are now readily available online, which creates exciting possibilities for users who lack the deep technical expertise or substantial computing resources needed to develop them. On the other hand, this type of open ecosystem comes with many risks. In this paper, we argue that the current ecosystem for open ML models contains significant supply-chain risks, some of which have been exploited already in real attacks. These include an attacker replacing a model with something malicious (e.g., malware), or a model being trained using a vulnerable version of a framework or on restricted or poisoned data. We then explore how Sigstore, a solution designed to bring transparency to open-source software supply chains, can be used to bring transparency to open ML models, in terms of enabling model publishers to sign their models and prove properties about the datasets they use.

Machine Learning Models Have a Supply Chain Problem

TL;DR

The paper identifies substantial supply-chain risks in the open ML model ecosystem, including tampering and data provenance concerns, and argues for verifiable transparency. It presents two practical defenses: model signing to bind published models to their publishers and a verifiable dataset framework that uses VRFs, Merkle-tree accumulators, and zero-knowledge sets to prove data inclusion without revealing sensitive data. A Python implementation demonstrates the feasibility of model signing and analyzes hashing costs for large models, while the dataset-verifiability design is evaluated across large data scales to assess performance and scalability. Together, these steps aim to enhance trust for users and regulators by enabling verifiable integrity and provenance in open ML model hubs, though the authors acknowledge that much work remains to realize broad adoption and robust guarantees.

Abstract

Powerful machine learning (ML) models are now readily available online, which creates exciting possibilities for users who lack the deep technical expertise or substantial computing resources needed to develop them. On the other hand, this type of open ecosystem comes with many risks. In this paper, we argue that the current ecosystem for open ML models contains significant supply-chain risks, some of which have been exploited already in real attacks. These include an attacker replacing a model with something malicious (e.g., malware), or a model being trained using a vulnerable version of a framework or on restricted or poisoned data. We then explore how Sigstore, a solution designed to bring transparency to open-source software supply chains, can be used to bring transparency to open ML models, in terms of enabling model publishers to sign their models and prove properties about the datasets they use.

Paper Structure

This paper contains 23 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The different actors and phases in the software supply chain, along with the points at which malicious actions can be taken. This image was taken from https://slsa.dev/spec/v1.0/threats-overview.
  • Figure 2: Averaged over five runs and plotted on a log-log scale, the time, in seconds, required to hash a file of a size ranging from 1 to 1T, on three different machines and using our two different approaches. The green dots represent the time to hash different large open models using the list-based approach on M3, as summarized in Table \ref{['tab:big-llms']}.
  • Figure 3: Algorithms for our zero-knowledge set, assuming an underlying accumulator $\mathsf{Acc}$ and VRF $\mathsf{VRF}$.
  • Figure 4: Averaged over ten runs and plotted on a log-log scale, the time, in seconds, to commit to and prove and verify inclusion in a data registry of a given size, ranging from 1000 to 100 million entries.