Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy
Chiara Mignacco, Matthieu Jonckheere, Gilles Stoltz
TL;DR
This work tackles online matching by learning to orchestrate a set of interpretable expert policies through advantage-based updates within an Adv2-inspired framework. It provides both expectation and high-probability regret guarantees and derives a novel finite-time bias bound for temporal-difference learning to enable reliable estimated advantages. A scalable neural actor-critic architecture implements the learned mixture over experts, ensuring real-time applicability in high-dimensional settings such as organ exchange networks. Empirical results on stochastic matching models demonstrate faster convergence and higher system efficiency than individual experts and standard RL baselines, highlighting the value of structured, adaptive learning for complex resource allocation tasks.
Abstract
Online matching problems arise in many complex systems, from cloud services and online marketplaces to organ exchange networks, where timely, principled decisions are critical for maintaining high system performance. Traditional heuristics in these settings are simple and interpretable but typically tailored to specific operating regimes, which can lead to inefficiencies when conditions change. We propose a reinforcement learning (RL) approach that learns to orchestrate a set of such expert policies, leveraging their complementary strengths in a data-driven, adaptive manner. Building on the Adv2 framework (Jonckheere et al., 2024), our method combines expert decisions through advantage-based weight updates and extends naturally to settings where only estimated value functions are available. We establish both expectation and high-probability regret guarantees and derive a novel finite-time bias bound for temporal-difference learning, enabling reliable advantage estimation even under constant step size and non-stationary dynamics. To support scalability, we introduce a neural actor-critic architecture that generalizes across large state spaces while preserving interpretability. Simulations on stochastic matching models, including an organ exchange scenario, show that the orchestrated policy converges faster and yields higher system level efficiency than both individual experts and conventional RL baselines. Our results highlight how structured, adaptive learning can improve the modeling and management of complex resource allocation and decision-making processes.
