Table of Contents
Fetching ...

Automating Code Adaptation for MLOps -- A Benchmarking Study on LLMs

Harsh Patel, Buvaneswari A. Ramanan, Manzoor A. Khan, Thomas Williams, Brian Friedman, Lawrence Drabeck

TL;DR

The paper investigates automating MLOps code adaptation using LLMs, introducing Inlining and Translation as two core tasks and benchmarking gpt-3.5-turbo against WizardCoder. It combines prompt-tuning, DocPrompting, and retrieval-based documentation strategies to improve API comprehension and cross-component translation. Results show GPT-3.5-turbo substantially outperforms WizardCoder on key MO functionalities (model optimization, experiment tracking, model registration, and hyperparameter optimization) and that doc-assisted translation approaches can bridge API gaps between components like GitPython and DVC. These findings suggest LLM-driven automation can significantly reduce manual effort in integrating and migrating MLOps capabilities across pipelines, with practical impact for enterprises and system integrators.

Abstract

This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benchmarking study that assesses the ability of these models to: (1) adapt existing code samples (Inlining) with component-specific MLOps functionality such as MLflow and Weights & Biases for experiment tracking, Optuna for hyperparameter optimization etc., and (2) perform the task of Translation from one component of an MLOps functionality to another, e.g., translating existing GitPython library based version control code to Data Version Control library based. We also propose three different approaches that involve teaching LLMs to comprehend the API documentation of the components as a reference while accomplishing the Translation tasks. In our evaluations, the gpt-3.5-turbo model significantly outperforms WizardCoder by achieving impressive Pass@3 accuracy in model optimization (55% compared to 0% by WizardCoder), experiment tracking (100%, compared to 62.5% by WizardCoder), model registration (92% compared to 42% by WizardCoder) and hyperparameter optimization (83% compared to 58% by WizardCoder) on average, in their best possible settings, showcasing its superior code adaptability performance in complex MLOps tasks.

Automating Code Adaptation for MLOps -- A Benchmarking Study on LLMs

TL;DR

The paper investigates automating MLOps code adaptation using LLMs, introducing Inlining and Translation as two core tasks and benchmarking gpt-3.5-turbo against WizardCoder. It combines prompt-tuning, DocPrompting, and retrieval-based documentation strategies to improve API comprehension and cross-component translation. Results show GPT-3.5-turbo substantially outperforms WizardCoder on key MO functionalities (model optimization, experiment tracking, model registration, and hyperparameter optimization) and that doc-assisted translation approaches can bridge API gaps between components like GitPython and DVC. These findings suggest LLM-driven automation can significantly reduce manual effort in integrating and migrating MLOps capabilities across pipelines, with practical impact for enterprises and system integrators.

Abstract

This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benchmarking study that assesses the ability of these models to: (1) adapt existing code samples (Inlining) with component-specific MLOps functionality such as MLflow and Weights & Biases for experiment tracking, Optuna for hyperparameter optimization etc., and (2) perform the task of Translation from one component of an MLOps functionality to another, e.g., translating existing GitPython library based version control code to Data Version Control library based. We also propose three different approaches that involve teaching LLMs to comprehend the API documentation of the components as a reference while accomplishing the Translation tasks. In our evaluations, the gpt-3.5-turbo model significantly outperforms WizardCoder by achieving impressive Pass@3 accuracy in model optimization (55% compared to 0% by WizardCoder), experiment tracking (100%, compared to 62.5% by WizardCoder), model registration (92% compared to 42% by WizardCoder) and hyperparameter optimization (83% compared to 58% by WizardCoder) on average, in their best possible settings, showcasing its superior code adaptability performance in complex MLOps tasks.
Paper Structure (41 sections, 10 figures, 4 tables)

This paper contains 41 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The iterative process of ML development
  • Figure 2: An end-to-end Machine Learning Operations (MLOps) System.
  • Figure 3: Illustration of the Prompt Tuning Process for Model Optimization (Example 1) and Experiment Tracking (Example 2) Tasks. The $+$ sign shows the iterative process of adapting our prompt to identify the most effective prompt for accomplishing the intended MLOps task.
  • Figure 4: Translation Task - Data Curation Pipeline
  • Figure 8: Code Inlining Task - The highlighted green sections demonstrate the expected inline adaptations when a simple model training script is provided as an input to an LLM.
  • ...and 5 more figures