"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning
Shreya Shankar, Rolando Garcia, Joseph M Hellerstein, Aditya G Parameswaran
TL;DR
The paper investigates how ML engineers operationalize production ML (MLOps) through semi-structured interviews with 18 MLEs across industries. It identifies a four-stage, human-centered workflow: data preparation, experimentation, evaluation/deployment, and monitoring/response, and introduces the 3Vs—velocity, visibility, and versioning—as core design criteria for successful deployments. The study highlights data-centric experimentation, collaborative feature engineering, multi-stage deployment, on-call monitoring, and dynamic validation datasets as key practices, while noting challenges such as data quality management, ground-truth delays, pipeline jungles, and heavy-tailed production bugs. It concludes with design implications and concrete tooling opportunities to support MLOps across stages, aiming to improve reliability, speed, and business impact of ML in production.
Abstract
Organizations rely on machine learning engineers (MLEs) to deploy models and maintain ML pipelines in production. Due to models' extensive reliance on fresh data, the operationalization of machine learning, or MLOps, requires MLEs to have proficiency in data science and engineering. When considered holistically, the job seems staggering -- how do MLEs do MLOps, and what are their unaddressed challenges? To address these questions, we conducted semi-structured ethnographic interviews with 18 MLEs working on various applications, including chatbots, autonomous vehicles, and finance. We find that MLEs engage in a workflow of (i) data preparation, (ii) experimentation, (iii) evaluation throughout a multi-staged deployment, and (iv) continual monitoring and response. Throughout this workflow, MLEs collaborate extensively with data scientists, product stakeholders, and one another, supplementing routine verbal exchanges with communication tools ranging from Slack to organization-wide ticketing and reporting systems. We introduce the 3Vs of MLOps: velocity, visibility, and versioning -- three virtues of successful ML deployments that MLEs learn to balance and grow as they mature. Finally, we discuss design implications and opportunities for future work.
