Table of Contents
Fetching ...

Mitigating Downstream Model Risks via Model Provenance

Keyu Wang, Abdullah Norozi Iranzad, Scott Schaffter, Meg Risdal, Doina Precup, Jonathan Lebensold

TL;DR

This work addresses the lack of transparent provenance for foundation models and their upstream dependencies, which can propagate risks to downstream systems. It proposes a machine-readable model specification and a unified model record (UMR) repository to trace upstream–downstream relationships and automate publication to multiple formats. Through a healthcare case study, it demonstrates how compromised upstream assets (e.g., PLIP and LAION-5B) can impact downstream models and highlight regulatory and ethical risks. The UMR aims to serve as an index and alert system integrated with platforms like HuggingFace and Kaggle to support responsible innovation, supported by community-driven standardization and semver-based provenance to mitigate risks in AI ecosystems.

Abstract

Research and industry are rapidly advancing the innovation and adoption of foundation model-based systems, yet the tools for managing these models have not kept pace. Understanding the provenance and lineage of models is critical for researchers, industry, regulators, and public trust. While model cards and system cards were designed to provide transparency, they fall short in key areas: tracing model genealogy, enabling machine readability, offering reliable centralized management systems, and fostering consistent creation incentives. This challenge mirrors issues in software supply chain security, but AI/ML remains at an earlier stage of maturity. Addressing these gaps requires industry-standard tooling that can be adopted by foundation model publishers, open-source model innovators, and major distribution platforms. We propose a machine-readable model specification format to simplify the creation of model records, thereby reducing error-prone human effort, notably when a new model inherits most of its design from a foundation model. Our solution explicitly traces relationships between upstream and downstream models, enhancing transparency and traceability across the model lifecycle. To facilitate the adoption, we introduce the unified model record (UMR) repository , a semantically versioned system that automates the publication of model records to multiple formats (PDF, HTML, LaTeX) and provides a hosted web interface (https://modelrecord.com/). This proof of concept aims to set a new standard for managing foundation models, bridging the gap between innovation and responsible model management.

Mitigating Downstream Model Risks via Model Provenance

TL;DR

This work addresses the lack of transparent provenance for foundation models and their upstream dependencies, which can propagate risks to downstream systems. It proposes a machine-readable model specification and a unified model record (UMR) repository to trace upstream–downstream relationships and automate publication to multiple formats. Through a healthcare case study, it demonstrates how compromised upstream assets (e.g., PLIP and LAION-5B) can impact downstream models and highlight regulatory and ethical risks. The UMR aims to serve as an index and alert system integrated with platforms like HuggingFace and Kaggle to support responsible innovation, supported by community-driven standardization and semver-based provenance to mitigate risks in AI ecosystems.

Abstract

Research and industry are rapidly advancing the innovation and adoption of foundation model-based systems, yet the tools for managing these models have not kept pace. Understanding the provenance and lineage of models is critical for researchers, industry, regulators, and public trust. While model cards and system cards were designed to provide transparency, they fall short in key areas: tracing model genealogy, enabling machine readability, offering reliable centralized management systems, and fostering consistent creation incentives. This challenge mirrors issues in software supply chain security, but AI/ML remains at an earlier stage of maturity. Addressing these gaps requires industry-standard tooling that can be adopted by foundation model publishers, open-source model innovators, and major distribution platforms. We propose a machine-readable model specification format to simplify the creation of model records, thereby reducing error-prone human effort, notably when a new model inherits most of its design from a foundation model. Our solution explicitly traces relationships between upstream and downstream models, enhancing transparency and traceability across the model lifecycle. To facilitate the adoption, we introduce the unified model record (UMR) repository , a semantically versioned system that automates the publication of model records to multiple formats (PDF, HTML, LaTeX) and provides a hosted web interface (https://modelrecord.com/). This proof of concept aims to set a new standard for managing foundation models, bridging the gap between innovation and responsible model management.
Paper Structure (21 sections, 3 figures)

This paper contains 21 sections, 3 figures.

Figures (3)

  • Figure 1: The provenance graph for the Llama-VID Short Video, a video-to-text captioning model built using a number of open-source foundation models.
  • Figure 2: Model provenance graph of PLIP with its upstream dependencies
  • Figure 3: LLaVA-1.6 Vicuna 13B in different formats including HTML, PDF, and GraphViz