Table of Contents
Fetching ...

Large-scale traffic signal control using machine learning: some traffic flow considerations

Jorge A. Laval, Hao Zhou

TL;DR

This study interrogates the performance of supervised learning, random search, and deep reinforcement learning (DRL) for large-scale traffic signal control on a torus-labeled grid modeled by Cellular Automaton Rule 184. It reveals that supervised learning with only two example states can outperform a greedy baseline, while random search finds near-optimal policies; DRL, in contrast, struggles under congested conditions, with effectiveness diminishing when training occupancy exceeds about $75\%$. The authors identify a congested-network property in urban grids that makes throughput largely policy-independent at high densities, explaining DRL’s poor learning in these regimes and advocating for discarding congested data or hybrid learning strategies. The work highlights the importance of training data selection and suggests combining DRL for free-flow with supervised methods for congestion to improve real-world applicability of learning-based signal control.

Abstract

This paper uses supervised learning, random search and deep reinforcement learning (DRL) methods to control large signalized intersection networks. The traffic model is Cellular Automaton rule 184, which has been shown to be a parameter-free representation of traffic flow, and is the most efficient implementation of the Kinematic Wave model with triangular fundamental diagram. We are interested in the steady-state performance of the system, both spatially and temporally: we consider a homogeneous grid network inscribed on a torus, which makes the network boundary-free, and drivers choose random routes. As a benchmark we use the longest-queue-first (LQF) greedy algorithm. We find that: (i) a policy trained with supervised learning with only two examples outperforms LQF, (ii) random search is able to generate near-optimal policies, (iii) the prevailing average network occupancy during training is the major determinant of the effectiveness of DRL policies. When trained under free-flow conditions one obtains DRL policies that are optimal for all traffic conditions, but this performance deteriorates as the occupancy during training increases. For occupancies > 75% during training, DRL policies perform very poorly for all traffic conditions, which means that DRL methods cannot learn under highly congested conditions. We conjecture that DRL's inability to learn under congestion might be explained by a property of urban networks found here, whereby even a very bad policy produces an intersection throughput higher than downstream capacity. This means that the actual throughput tends to be independent of the policy. Our findings imply that it is advisable for current DRL methods in the literature to discard any congested data when training, and that doing this will improve their performance under all traffic conditions.

Large-scale traffic signal control using machine learning: some traffic flow considerations

TL;DR

This study interrogates the performance of supervised learning, random search, and deep reinforcement learning (DRL) for large-scale traffic signal control on a torus-labeled grid modeled by Cellular Automaton Rule 184. It reveals that supervised learning with only two example states can outperform a greedy baseline, while random search finds near-optimal policies; DRL, in contrast, struggles under congested conditions, with effectiveness diminishing when training occupancy exceeds about . The authors identify a congested-network property in urban grids that makes throughput largely policy-independent at high densities, explaining DRL’s poor learning in these regimes and advocating for discarding congested data or hybrid learning strategies. The work highlights the importance of training data selection and suggests combining DRL for free-flow with supervised methods for congestion to improve real-world applicability of learning-based signal control.

Abstract

This paper uses supervised learning, random search and deep reinforcement learning (DRL) methods to control large signalized intersection networks. The traffic model is Cellular Automaton rule 184, which has been shown to be a parameter-free representation of traffic flow, and is the most efficient implementation of the Kinematic Wave model with triangular fundamental diagram. We are interested in the steady-state performance of the system, both spatially and temporally: we consider a homogeneous grid network inscribed on a torus, which makes the network boundary-free, and drivers choose random routes. As a benchmark we use the longest-queue-first (LQF) greedy algorithm. We find that: (i) a policy trained with supervised learning with only two examples outperforms LQF, (ii) random search is able to generate near-optimal policies, (iii) the prevailing average network occupancy during training is the major determinant of the effectiveness of DRL policies. When trained under free-flow conditions one obtains DRL policies that are optimal for all traffic conditions, but this performance deteriorates as the occupancy during training increases. For occupancies > 75% during training, DRL policies perform very poorly for all traffic conditions, which means that DRL methods cannot learn under highly congested conditions. We conjecture that DRL's inability to learn under congestion might be explained by a property of urban networks found here, whereby even a very bad policy produces an intersection throughput higher than downstream capacity. This means that the actual throughput tends to be independent of the policy. Our findings imply that it is advisable for current DRL methods in the literature to discard any congested data when training, and that doing this will improve their performance under all traffic conditions.

Paper Structure

This paper contains 17 sections, 9 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: CA Rule 184: The top row in each of the eight cases shows the the neighborhood values $( c_{i-1} , c_i , c_{i+1})$ and the updated $c_i$ in the bottom row.
  • Figure 2: Example $3\times$4 traffic network. The connecting links to form the torus are shown as dashed directed links; we have omitted the cells on these links to avoid clutter. Each segment has $n=5$ cells; an additional cell has been added downstream of each segment to indicate the traffic light color.
  • Figure 3: Neural network architecture to approximate the policy. The numbers on top of the arrows indicate the dimensions of the corresponding input/output vectors, and the numbers below the squares are as follows: the input is the state observable by the agent, 1: linear layer, 2: tanh function, 3: linear layer, 4: summation layer, 5: sigmoid function, and the output is a single real number that gives the probability of turning the light red for the North-South approaches.
  • Figure 4: Random policies. Each diagram is a different trial, and shows the average density versus average flow in the network. The dashed line corresponds to the benchmark LQF policy. The red and green envelope curves show the MFD bounds.
  • Figure 5: Supervised learning experiment. Left: extreme state $s_1$, where both North-South approaches are empty and the East-West ones are at jam density; we have omitted the cells on links other than the ones observable by the middle intersection to avoid clutter. Middle: extreme state $s_2$, the opposite of $s_1$. Right: resulting MFD (shaded area).
  • ...and 5 more figures