Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

Jiaqi Liang; Defeng Liu; Sanjay Dominik Jena; Andrea Lodi; Thibaut Vidal

Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

Jiaqi Liang, Defeng Liu, Sanjay Dominik Jena, Andrea Lodi, Thibaut Vidal

TL;DR

The paper tackles dynamic bike rebalancing in bike-sharing systems under stochastic demand using two reinforcement learning architectures. It formulates the problem as a continuous-time MDP and introduces SPRL, which jointly learns inventory and routing, and DPRL, which decouples inventory and routing into two specialized DQNs with shared state. Through an event-driven simulator and synthetic GT1/GT2 datasets, DPRL consistently outperforms SPRL and MIP baselines, achieving sizable reductions in lost demand (e.g., GT1: 23.1, GT2: 10.0) and demonstrating robust, real-time applicability after offline training. The work highlights the value of decoupled decision-making for dynamic, asynchronous fleet operations and points to future extensions in larger networks and additional mobility features like charging and pricing strategies.

Abstract

Bike-sharing systems (BSS) provide a sustainable urban mobility solution, but ensuring their reliability requires effective rebalancing strategies to address stochastic demand and prevent station imbalances. This paper proposes reinforcement learning (RL) algorithms for dynamic rebalancing problem with multiple vehicles, introducing and comparing two RL approaches: Single-policy RL and Dual-policy RL. We formulate this network optimization problem as a Markov Decision Process within a continuous-time framework, allowing vehicles to make independent and cooperative rebalancing decisions without synchronization constraints. In the first approach, a single deep Q-network (DQN) is trained to jointly learn inventory and routing decisions. The second approach decouples node-level inventory decisions from arc-level vehicle routing, enhancing learning efficiency and adaptability. A high-fidelity simulator under the first-arrive-first-serve rule is developed to estimate rewards across diverse demand scenarios influenced by temporal and weather variations. Extensive experiments demonstrate that while the single-policy model is competitive against several benchmarks, the dual-policy model significantly reduces lost demand. These findings provide valuable insights for bike-sharing operators, reinforcing the potential of RL for real-time rebalancing and paving the way for more adaptive and intelligent urban mobility solutions.

Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

TL;DR

Abstract

Paper Structure (25 sections, 9 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Models based on Mixed Integer Programming
Markov Decision Processes
Reinforcement Learning
Problem Definition and General MDP Framework
Network Model
MDP Framework for SPRL and DPRL
State Space
Action Space
Network Loss (Reward) Function
Transition Function and Event-Driven Simulator
Continuous Time Framework
RL Policies for DBRP
Q-learning and DQN
...and 10 more sections

Figures (11)

Figure 1: Dynamic rebalancing in BSS and DPRL: Inventory information, vehicle location and inventory level, and user demand typically serve as inputs in dynamic rebalancing models. The proposed DPRL generates rebalancing solutions based on the environmental interactions among stations, vehicles, and users.
Figure 2: Continuous time planning framework of SPRL
Figure 3: Continuous time planning framework of DPRL
Figure 4: SPRL Pipeline for Real-time Rebalancing
Figure 5: DPRL Pipeline for Real-time Rebalancing
...and 6 more figures

Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

TL;DR

Abstract

Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

Authors

TL;DR

Abstract

Table of Contents

Figures (11)