Table of Contents
Fetching ...

Reinforcement Learning for Machine Learning Model Deployment: Evaluating Multi-Armed Bandits in ML Ops Environments

S. Aaron McClendon, Vishaal Venkatesh, Juan Morinelli

TL;DR

The paper addresses robust model deployment in ML Ops under distribution shift by evaluating reinforcement-learning-based strategies, specifically multi-armed bandits, against naïve, validation-based, and A/B testing approaches. It implements and compares epsilon-greedy, UCB, and Thompson Sampling in a dynamic, chunked simulation over Census wage and fraud datasets, using tailored reward functions and metrics. Key findings show RL methods can match or exceed traditional baselines, with epsilon-greedy often delivering strong overall performance and helping automate adaptation to drift, particularly in imbalanced domains. The work highlights practical implications for automating real-time deployment decisions and outlines future directions for scaling, regression tasks, and drift-aware reward shaping to further reduce manual monitoring in production systems.

Abstract

In modern ML Ops environments, model deployment is a critical process that traditionally relies on static heuristics such as validation error comparisons and A/B testing. However, these methods require human intervention to adapt to real-world deployment challenges, such as model drift or unexpected performance degradation. We investigate whether reinforcement learning, specifically multi-armed bandit (MAB) algorithms, can dynamically manage model deployment decisions more effectively. Our approach enables more adaptive production environments by continuously evaluating deployed models and rolling back underperforming ones in real-time. We test six model selection strategies across two real-world datasets and find that RL based approaches match or exceed traditional methods in performance. Our findings suggest that reinforcement learning (RL)-based model management can improve automation, reduce reliance on manual interventions, and mitigate risks associated with post-deployment model failures.

Reinforcement Learning for Machine Learning Model Deployment: Evaluating Multi-Armed Bandits in ML Ops Environments

TL;DR

The paper addresses robust model deployment in ML Ops under distribution shift by evaluating reinforcement-learning-based strategies, specifically multi-armed bandits, against naïve, validation-based, and A/B testing approaches. It implements and compares epsilon-greedy, UCB, and Thompson Sampling in a dynamic, chunked simulation over Census wage and fraud datasets, using tailored reward functions and metrics. Key findings show RL methods can match or exceed traditional baselines, with epsilon-greedy often delivering strong overall performance and helping automate adaptation to drift, particularly in imbalanced domains. The work highlights practical implications for automating real-time deployment decisions and outlines future directions for scaling, regression tasks, and drift-aware reward shaping to further reduce manual monitoring in production systems.

Abstract

In modern ML Ops environments, model deployment is a critical process that traditionally relies on static heuristics such as validation error comparisons and A/B testing. However, these methods require human intervention to adapt to real-world deployment challenges, such as model drift or unexpected performance degradation. We investigate whether reinforcement learning, specifically multi-armed bandit (MAB) algorithms, can dynamically manage model deployment decisions more effectively. Our approach enables more adaptive production environments by continuously evaluating deployed models and rolling back underperforming ones in real-time. We test six model selection strategies across two real-world datasets and find that RL based approaches match or exceed traditional methods in performance. Our findings suggest that reinforcement learning (RL)-based model management can improve automation, reduce reliance on manual interventions, and mitigate risks associated with post-deployment model failures.

Paper Structure

This paper contains 53 sections, 15 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Model selection behavior across different values of $\epsilon$ in the Census dataset.
  • Figure 2: Balanced Classification scores across deployment chunks for different $\epsilon$ values in the Census dataset.
  • Figure 3: Dominant model selection by chunk across all methods (PR-AUC). The visualization reveals distinct selection patterns: Validation-Based consistently selects model 0 (orange dashed line), Epsilon-Greedy consistently selects model 1 (red dotted line), while UCB (purple solid line) and Thompson Sampling (brown dashed line) progressively explore higher-numbered models in later chunks. Note that the RL methods are showing majority model selection per chunk because they were able to switch models dynamically within a chunk.
  • Figure 4: Thompson Sampling's batch-level model selection across all chunks, showing the dynamic switching pattern between models during the learning process. Each line represents a dataset chunk, and each point on the line a unique batch at which the model was able to decide to swap models.
  • Figure 5: Transition probabilities between selected models in Chunk 3 using Thompson Sampling. Rows represent the model selected at time t and columns represent the model selected at time t+1. Each cell shows the percentage of transitions observed between model pairs. For example, after selecting Model 2, the algorithm transitioned to Model 0 in 100% of cases, indicating strong instability in Model 2 selections. In contrast, Model 1 exhibited higher persistence, remaining selected in 40% of transitions.
  • ...and 1 more figures