Table of Contents
Fetching ...

Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments

Grigori Fursin

TL;DR

The paper addresses the challenge of designing and running AI/ML workloads efficiently across heterogeneous models, data, software, and hardware while protecting intellectual property. It proposes a virtualization-centric framework, Collective Mind (CM), with CM4MLOps, MLPerf-driven benchmarks, and the Collective Knowledge Playground to automate, standardize, and reproduce experiments via portable automation recipes (CM scripts) and artifact management. Key contributions include donating CM and CK to MLCommons, modularizing MLPerf benchmarks through CM4MLOps (cm4mlperf) and hosting the first large open challenge that yielded 12,217 MLPerf v3.1 inference results from 20+ companies, demonstrating scalable community collaboration and reproducibility. The approach aims to democratize AI research, reduce costs, and enable automatic software/hardware co-design across vendors, with future plans (e.g., CMX, CMX4MLOps) to further automate co-design and maintenance of portable automations across evolving platforms.

Abstract

This white paper introduces my educational community initiative to learn how to run AI, ML and other emerging workloads in the most efficient and cost-effective way across diverse models, data sets, software and hardware. This project leverages Collective Mind (CM), virtualized MLOps and DevOps (CM4MLOps), MLPerf benchmarks, and the Collective Knowledge playground (CK), which I have developed in collaboration with the community and MLCommons. I created Collective Mind as a small and portable Python package with minimal dependencies, a unified CLI and Python API to help researchers and engineers automate repetitive, tedious, and time-consuming tasks. I also designed CM as a distributed framework, continuously enhanced by the community through the CM4* repositories, which function as the unified interface for organizing and managing various collections of automations and artifacts. For example, CM4MLOps repository includes many automations, also known as CM scripts, to streamline the process of building, running, benchmarking, and optimizing AI, ML, and other workflows across ever-evolving models, data, and systems. I donated CK, CM and CM4MLOps to MLCommons to foster collaboration between academia and industry to learn how to co-design more efficient and cost-effective AI systems while capturing and encoding knowledge within Collective Mind, protecting intellectual property, enabling portable skills, and accelerating the transition of the state-of-the-art research into production. My ultimate goal is to collaborate with the community to complete my two-decade journey toward creating self-optimizing software and hardware that can automatically learn how to run any workload in the most efficient and cost-effective manner based on user requirements and constraints such as cost, latency, throughput, accuracy, power consumption, size, and other critical factors.

Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments

TL;DR

The paper addresses the challenge of designing and running AI/ML workloads efficiently across heterogeneous models, data, software, and hardware while protecting intellectual property. It proposes a virtualization-centric framework, Collective Mind (CM), with CM4MLOps, MLPerf-driven benchmarks, and the Collective Knowledge Playground to automate, standardize, and reproduce experiments via portable automation recipes (CM scripts) and artifact management. Key contributions include donating CM and CK to MLCommons, modularizing MLPerf benchmarks through CM4MLOps (cm4mlperf) and hosting the first large open challenge that yielded 12,217 MLPerf v3.1 inference results from 20+ companies, demonstrating scalable community collaboration and reproducibility. The approach aims to democratize AI research, reduce costs, and enable automatic software/hardware co-design across vendors, with future plans (e.g., CMX, CMX4MLOps) to further automate co-design and maintenance of portable automations across evolving platforms.

Abstract

This white paper introduces my educational community initiative to learn how to run AI, ML and other emerging workloads in the most efficient and cost-effective way across diverse models, data sets, software and hardware. This project leverages Collective Mind (CM), virtualized MLOps and DevOps (CM4MLOps), MLPerf benchmarks, and the Collective Knowledge playground (CK), which I have developed in collaboration with the community and MLCommons. I created Collective Mind as a small and portable Python package with minimal dependencies, a unified CLI and Python API to help researchers and engineers automate repetitive, tedious, and time-consuming tasks. I also designed CM as a distributed framework, continuously enhanced by the community through the CM4* repositories, which function as the unified interface for organizing and managing various collections of automations and artifacts. For example, CM4MLOps repository includes many automations, also known as CM scripts, to streamline the process of building, running, benchmarking, and optimizing AI, ML, and other workflows across ever-evolving models, data, and systems. I donated CK, CM and CM4MLOps to MLCommons to foster collaboration between academia and industry to learn how to co-design more efficient and cost-effective AI systems while capturing and encoding knowledge within Collective Mind, protecting intellectual property, enabling portable skills, and accelerating the transition of the state-of-the-art research into production. My ultimate goal is to collaborate with the community to complete my two-decade journey toward creating self-optimizing software and hardware that can automatically learn how to run any workload in the most efficient and cost-effective manner based on user requirements and constraints such as cost, latency, throughput, accuracy, power consumption, size, and other critical factors.
Paper Structure (5 sections, 5 figures)

This paper contains 5 sections, 5 figures.

Figures (5)

  • Figure 1: The decentralized architecture of Collective Mind.
  • Figure 2: Many research projects implement the same "research actions" using diverse ad-hoc OS commands and Python scripts, sharing them along with loosely organized Git repositories, containers, and Jupyter notebooks to enable the community to reuse their projects and reproduce experiments doi:10.1098/rsta.2020.0211acm_techtalk_fursin_reproducibility_2022. This figure includes the MAD landscape (Machine Learning, AI, and Data) sourced from https://mad.firstmark.com.
  • Figure 3: The basic workflow of a CM script
  • Figure 4: The Collective Mind framework enables the creation of portable, technology-agnostic automation recipes (CM scripts). These scripts can be seamlessly reused across diverse research projects, dynamically adapting to varying operating systems, models, datasets, software, and hardware.
  • Figure 5: Collective Knowledge Playground: an educational community project to learn how to co-design more efficient and cost-effective software and hardware for AI, ML and other emerging workloads with the help of Collective Mind, virtualized MLOps, MLPerf and reproducible optimization challenges.