An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

Xinlang Yue; Yiran Liu; Fangzhou Shi; Sihong Luo; Chen Zhong; Min Lu; Zhe Xu

An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

Xinlang Yue, Yiran Liu, Fangzhou Shi, Sihong Luo, Chen Zhong, Min Lu, Zhe Xu

TL;DR

MICOD in ride-hailing presents bilateral, dynamic matching challenges within short, batch-based windows. The authors propose an end-to-end reinforcement learning solution built on a two-layer MDP and the Deep Double Scalable Network (D2SN) to generate order-driver assignments directly, including a Hold mechanism to defer suboptimal matches. Across real Didi benchmarks and a calibrated simulator, D2SN consistently improves matching efficiency and driver income (TDI) while maintaining passenger experience (APD/CR), outperforming both one-shot CO methods and two-stage DRL baselines. The approach demonstrates practical impact for scalable, online optimization in ride-hailing and offers deployment guidance for large-scale systems.

Abstract

Assigning orders to drivers under localized spatiotemporal context (micro-view order-dispatching) is a major task in Didi, as it influences ride-hailing service experience. Existing industrial solutions mainly follow a two-stage pattern that incorporate heuristic or learning-based algorithms with naive combinatorial methods, tackling the uncertainty of both sides' behaviors, including emerging timings, spatial relationships, and travel duration, etc. In this paper, we propose a one-stage end-to-end reinforcement learning based order-dispatching approach that solves behavior prediction and combinatorial optimization uniformly in a sequential decision-making manner. Specifically, we employ a two-layer Markov Decision Process framework to model this problem, and present \underline{D}eep \underline{D}ouble \underline{S}calable \underline{N}etwork (D2SN), an encoder-decoder structure network to generate order-driver assignments directly and stop assignments accordingly. Besides, by leveraging contextual dynamics, our approach can adapt to the behavioral patterns for better performance. Extensive experiments on Didi's real-world benchmarks justify that the proposed approach significantly outperforms competitive baselines in optimizing matching efficiency and user experience tasks. In addition, we evaluate the deployment outline and discuss the gains and experiences obtained during the deployment tests from the view of large-scale engineering implementation.

An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

TL;DR

Abstract

Paper Structure (27 sections, 7 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Methodology
MICOD Formulation
Two-layer MDP Framework
Outer-layer MDP
Inner-layer MDP
Deep Double Scalable Network
Encoder
Decoder
Cooperation
Deep Reinforcement Learning with D2SN
Actor-Critic Design
DRL Training
Experiments
...and 12 more sections

Figures (4)

Figure 1: Micro-view order-dispatching with the proposed end-to-end framework.
Figure 2: Order-dispatching process under the two-layer MDP framework. The upper shows the outer-layer state transition. The lower shows the inner-layer sub-state transition.
Figure 3: The architecture of D2SN. The state is updated after each sub-step $i$ of batch $t$ auto-regressively.
Figure 4: Comparison of methods with "hold" strategies, results are averaged across APD and TDI task test samples, respectively.

An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

TL;DR

Abstract

An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)