Table of Contents
Fetching ...

Accelerated Cloud for Artificial Intelligence (ACAI)

Dachi Chen, Weitian Ding, Chen Liang, Chang Xu, Junwei Zhang, Majd Sakr

TL;DR

ACAI addresses the repetitive, data- and resource-intensive aspects of ML workflows by delivering an end-to-end cloud platform comprising a versioned data lake and an execution engine with automatic resource provisioning and provenance tracking. A runtime predictor with a log-linear formulation guides grid-search based auto-provisioning to optimize either runtime or cost, yielding substantial gains on MNIST (1.7x speedup and 39% cost reduction) and faster, cheaper usability in hyperparameter tuning. The platform integrates a MongoDB-backed metadata layer and a Neo4j provenance graph, enabling reproducibility and traceability of experiments, and is demonstrated through a usability study showing reduced setup time and cost relative to manual GCP workflows. Deployed at CMU and supported by microservices, ACAI lays groundwork for scalable, reproducible ML pipelines with future extensions to data lake access control, caching, and distributed training frameworks.

Abstract

Training an effective Machine learning (ML) model is an iterative process that requires effort in multiple dimensions. Vertically, a single pipeline typically includes an initial ETL (Extract, Transform, Load) of raw datasets, a model training stage, and an evaluation stage where the practitioners obtain statistics of the model performance. Horizontally, many such pipelines may be required to find the best model within a search space of model configurations. Many practitioners resort to maintaining logs manually and writing simple glue code to automate the workflow. However, carrying out this process on the cloud is not a trivial task in terms of resource provisioning, data management, and bookkeeping of job histories to make sure the results are reproducible. We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI), to help improve the productivity of ML practitioners. ACAI achieves this goal by enabling cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking. Specifically, ACAI provides practitioners (1) a data lake for storing versioned datasets and their corresponding metadata, and (2) an execution engine for executing ML jobs on the cloud with automatic resource provisioning (auto-provision), logging and provenance tracking. To evaluate ACAI, we test the efficacy of our auto-provisioner on the MNIST handwritten digit classification task, and we study the usability of our system using experiments and interviews. We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.

Accelerated Cloud for Artificial Intelligence (ACAI)

TL;DR

ACAI addresses the repetitive, data- and resource-intensive aspects of ML workflows by delivering an end-to-end cloud platform comprising a versioned data lake and an execution engine with automatic resource provisioning and provenance tracking. A runtime predictor with a log-linear formulation guides grid-search based auto-provisioning to optimize either runtime or cost, yielding substantial gains on MNIST (1.7x speedup and 39% cost reduction) and faster, cheaper usability in hyperparameter tuning. The platform integrates a MongoDB-backed metadata layer and a Neo4j provenance graph, enabling reproducibility and traceability of experiments, and is demonstrated through a usability study showing reduced setup time and cost relative to manual GCP workflows. Deployed at CMU and supported by microservices, ACAI lays groundwork for scalable, reproducible ML pipelines with future extensions to data lake access control, caching, and distributed training frameworks.

Abstract

Training an effective Machine learning (ML) model is an iterative process that requires effort in multiple dimensions. Vertically, a single pipeline typically includes an initial ETL (Extract, Transform, Load) of raw datasets, a model training stage, and an evaluation stage where the practitioners obtain statistics of the model performance. Horizontally, many such pipelines may be required to find the best model within a search space of model configurations. Many practitioners resort to maintaining logs manually and writing simple glue code to automate the workflow. However, carrying out this process on the cloud is not a trivial task in terms of resource provisioning, data management, and bookkeeping of job histories to make sure the results are reproducible. We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI), to help improve the productivity of ML practitioners. ACAI achieves this goal by enabling cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking. Specifically, ACAI provides practitioners (1) a data lake for storing versioned datasets and their corresponding metadata, and (2) an execution engine for executing ML jobs on the cloud with automatic resource provisioning (auto-provision), logging and provenance tracking. To evaluate ACAI, we test the efficacy of our auto-provisioner on the MNIST handwritten digit classification task, and we study the usability of our system using experiments and interviews. We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.
Paper Structure (50 sections, 5 equations, 16 figures, 9 tables)

This paper contains 50 sections, 5 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: ACAI System Abstractions
  • Figure 2: Relationship between two nodes
  • Figure 3: Job Life Cycle
  • Figure 4: Dashboard job history page
  • Figure 5: Dashboard provenance page
  • ...and 11 more figures