Atlas: A Framework for ML Lifecycle Provenance & Transparency
Marcin Spoczynski, Marcela S. Melara, Sebastian Szyller
TL;DR
Atlas addresses the risk of insecure ML lifecycles by providing end-to-end provenance and tamper detection through a framework that combines hardware-backed TEEs, cryptographic attestations, and open provenance specifications. It introduces a two-component architecture—a transparent service and a verification service—tied together by attestation clients and an append-only transparency log that uses Merkle trees to capture artifact measurements and transformation attestations. The authors implement a prototype using Intel TDX, C2PA manifests, and Kubeflow/PyTorch, and validate it with a BERT fine-tuning case study, reporting under-8% training overhead and significant verification efficiency gains from caching and batching. This work advances ML supply-chain integrity by enabling verifiable, auditable provenance across data, training, evaluation, and deployment, with practical implications for compliance and trust in AI deployments.
Abstract
The rapid adoption of open source machine learning (ML) datasets and models exposes today's AI applications to critical risks like data poisoning and supply chain attacks across the ML lifecycle. With growing regulatory pressure to address these issues through greater transparency, ML model vendors face challenges balancing these requirements against confidentiality for data and intellectual property needs. We propose Atlas, a framework that enables fully attestable ML pipelines. Atlas leverages open specifications for data and software supply chain provenance to collect verifiable records of model artifact authenticity and end-to-end lineage metadata. Atlas combines trusted hardware and transparency logs to enhance metadata integrity, preserve data confidentiality, and limit unauthorized access during ML pipeline operations, from training through deployment. Our prototype implementation of Atlas integrates several open-source tools to build an ML lifecycle transparency system, and assess the practicality of Atlas through two case study ML pipelines.
