"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning

Shreya Shankar; Rolando Garcia; Joseph M Hellerstein; Aditya G Parameswaran

"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning

Shreya Shankar, Rolando Garcia, Joseph M Hellerstein, Aditya G Parameswaran

TL;DR

The paper investigates how ML engineers operationalize production ML (MLOps) through semi-structured interviews with 18 MLEs across industries. It identifies a four-stage, human-centered workflow: data preparation, experimentation, evaluation/deployment, and monitoring/response, and introduces the 3Vs—velocity, visibility, and versioning—as core design criteria for successful deployments. The study highlights data-centric experimentation, collaborative feature engineering, multi-stage deployment, on-call monitoring, and dynamic validation datasets as key practices, while noting challenges such as data quality management, ground-truth delays, pipeline jungles, and heavy-tailed production bugs. It concludes with design implications and concrete tooling opportunities to support MLOps across stages, aiming to improve reliability, speed, and business impact of ML in production.

Abstract

Organizations rely on machine learning engineers (MLEs) to deploy models and maintain ML pipelines in production. Due to models' extensive reliance on fresh data, the operationalization of machine learning, or MLOps, requires MLEs to have proficiency in data science and engineering. When considered holistically, the job seems staggering -- how do MLEs do MLOps, and what are their unaddressed challenges? To address these questions, we conducted semi-structured ethnographic interviews with 18 MLEs working on various applications, including chatbots, autonomous vehicles, and finance. We find that MLEs engage in a workflow of (i) data preparation, (ii) experimentation, (iii) evaluation throughout a multi-staged deployment, and (iv) continual monitoring and response. Throughout this workflow, MLEs collaborate extensively with data scientists, product stakeholders, and one another, supplementing routine verbal exchanges with communication tools ranging from Slack to organization-wide ticketing and reporting systems. We introduce the 3Vs of MLOps: velocity, visibility, and versioning -- three virtues of successful ML deployments that MLEs learn to balance and grow as they mature. Finally, we discuss design implications and opportunities for future work.

"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning

TL;DR

Abstract

Paper Structure (44 sections, 3 figures, 3 tables)

This paper contains 44 sections, 3 figures, 3 tables.

Introduction
Related Work
Characterizing the ML Engineer
Machine Learning Workflows
Production ML Challenges
Software Engineering for ML
MLOps Practices and Challenges
Methods
Participant Recruitment & Selection
Initial Recruitment: Relying on Professional Networks
Course Correction: Diversifying the Sample
Building a Representative Sample: Iterative Refinement
Interview Protocol
Transcript Coding & Analysis
Summary of Findings
...and 29 more sections

Figures (3)

Figure 1: Core tasks in the MLOps workflow. Prior work discusses a production data science workflow of preparation, modeling, and deployment autoai. Our work exposes (i) the scheduled and recurring nature of data preparation (including automated ML tasks, such as model retraining), identifies (ii) a broader experimentation step (which could include modeling or adding new features), and provides more insight into human-centered (iii) evaluation & deployment, and (iv) monitoring & response.
Figure 2: Visual summary of coded transcripts. The x-axis of (a), the color-coded overview, corresponds to a segment (or group) of transcript lines, and the number in each cell is the code's frequency for that transcript segment and for that participant. Segments are blank after the conclusion of each interview, and different interviews had different time duration. Each color in (a) is associated with a top-level axial code from our interview study, and presented in the color legend (b). The color legend also shows the frequency of each code across all interviews.
Figure 3: Abridged code system: A distilled representation of the evolved code system resulting from our qualitative study, capturing the primary tasks, organizational aspects, operational methodologies, challenges, and tools utilized by Machine Learning Engineers.

"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning

TL;DR

Abstract

"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)