Table of Contents
Fetching ...

Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

Zijian Zhao, Sen Li

TL;DR

This work tackles real-time, large-scale order dispatch in ride-sharing by formulating a centralized single-agent reinforcement learning (SARL) approach, Triple-BERT, that leverages action decomposition and a BERT-based network to manage enormous and dynamic action and observation spaces. A two-stage training pipeline—Stage 1 MARL-inspired pre-training (IDDQN) for robust feature extraction, followed by Stage 2 centralized TD3 fine-tuning—addresses data scarcity and coordination challenges. The method introduces a QK-attention mechanism with a normalization strategy to reduce computational load and stabilize learning, and uses a policy gradient–style update to optimize a decomposed joint-action policy. On a real-world Manhattan dataset, Triple-BERT outperforms state-of-the-art MARL baselines in served orders and pickup times, while maintaining practical inference latency, illustrating scalability and potential for practical deployment in large ride-sharing platforms.

Abstract

On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers-each with distinct origins and destinations-to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95% improvement over current state-of-the-art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the repository https://github.com/RS2002/Triple-BERT .

Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

TL;DR

This work tackles real-time, large-scale order dispatch in ride-sharing by formulating a centralized single-agent reinforcement learning (SARL) approach, Triple-BERT, that leverages action decomposition and a BERT-based network to manage enormous and dynamic action and observation spaces. A two-stage training pipeline—Stage 1 MARL-inspired pre-training (IDDQN) for robust feature extraction, followed by Stage 2 centralized TD3 fine-tuning—addresses data scarcity and coordination challenges. The method introduces a QK-attention mechanism with a normalization strategy to reduce computational load and stabilize learning, and uses a policy gradient–style update to optimize a decomposed joint-action policy. On a real-world Manhattan dataset, Triple-BERT outperforms state-of-the-art MARL baselines in served orders and pickup times, while maintaining practical inference latency, illustrating scalability and potential for practical deployment in large ride-sharing platforms.

Abstract

On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers-each with distinct origins and destinations-to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95% improvement over current state-of-the-art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the repository https://github.com/RS2002/Triple-BERT .

Paper Structure

This paper contains 31 sections, 16 equations, 7 figures, 10 tables, 2 algorithms.

Figures (7)

  • Figure 1: Workflow: At each time step, the worker and order pools update their states based on the assignments made in the previous time step. Specifically, the order pool adds newly arrived orders and removes overdue ones. For IDDQN, the Q-value of each worker-order pair is calculated, and ILP is applied to maximize the global Q-value. For TD3, the probability of each worker-order pair is computed, followed by the application of ILP to maximize the global assignment probability.
  • Figure 2: Network Architecture: The network consists of three main components: the feature extractor, the actor sub-network, and the critic sub-network. First, a worker encoder and an order encoder are used to extract features from individual worker and order information, respectively. Then an Actor BERT model captures the relationships between them and a QK-Attention module calculates the selection probabilities for each worker-order pair. Finally, the fused features of the selected worker-order pairs are input into two separate Critic BERT models for further information extraction, and two Critic MLPs compute the Q-values, as TD3 requires two critics. (In this figure, the fused sequence (input to Critic-BERT) represents workers $1$, $3$, $6$, and $n$ selecting orders $2$, $3$, $4$, and $m$, respectively.)
  • Figure 3: Training Process: Each method is trained five times, and the curve is smoothed using Exponential Moving Average (EMA) with $\alpha = 0.1$. The shaded area represents the standard deviation.
  • Figure 4: Comparison Between Different Noise Methods: Each method is trained three times, and the curve is smoothed using EMA with $\alpha = 0.1$. The shaded area represents the range of fluctuations, while the solid line indicates the average value.
  • Figure 5: Network Structure in Stage 1
  • ...and 2 more figures