Adaptive Reinforcement Learning for Dynamic Configuration Allocation in Pre-Production Testing
Yu Zhu
TL;DR
Addressing adaptive configuration allocation in non-stationary pre-production testing, the paper introduces a Q-learning framework with hybrid reward shaping that blends simulated and real feedback and an online-offline training scheme to track abrupt probability shifts. The method defines an MDP with state $S_t=[n_i(t),\hat{p}_i(t)]$, action $A_t=(i,j,\Delta)$, and a reward $R_t$ combining simulated and observed signals, updated via $Q(S_t,A_t)$. Simulation results show the RL approach outperforms static and optimization baselines and closely approaches the oracle that knows true probabilities, delivering higher coverage $D_t$ and lower $p_i(t)$ estimation error $\text{MSE}$, thereby validating its effectiveness in dynamic, high-dimensional testing environments. This work advances adaptive testing and dynamic resource scheduling by providing a principled, scalable RL framework applicable to diverse heterogeneous systems.
Abstract
Ensuring reliability in modern software systems requires rigorous pre-production testing across highly heterogeneous and evolving environments. Because exhaustive evaluation is infeasible, practitioners must decide how to allocate limited testing resources across configurations where failure probabilities may drift over time. Existing combinatorial optimization approaches are static, ad hoc, and poorly suited to such non-stationary settings. We introduce a novel reinforcement learning (RL) framework that recasts configuration allocation as a sequential decision-making problem. Our method is the first to integrate Q-learning with a hybrid reward design that fuses simulated outcomes and real-time feedback, enabling both sample efficiency and robustness. In addition, we develop an adaptive online-offline training scheme that allows the agent to quickly track abrupt probability shifts while maintaining long-run stability. Extensive simulation studies demonstrate that our approach consistently outperforms static and optimization-based baselines, approaching oracle performance. This work establishes RL as a powerful new paradigm for adaptive configuration allocation, advancing beyond traditional methods and offering broad applicability to dynamic testing and resource scheduling domains.
