Table of Contents
Fetching ...

Massively Scalable Inverse Reinforcement Learning in Google Maps

Matt Barnes, Matthew Abueg, Oliver F. Lange, Matt Deeds, Jason Trader, Denali Molitor, Markus Wulfmeier, Shawn O'Banion

TL;DR

Scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms are introduced.

Abstract

Inverse reinforcement learning (IRL) offers a powerful and general framework for learning humans' latent preferences in route recommendation, yet no approach has successfully addressed planetary-scale problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL methods in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.

Massively Scalable Inverse Reinforcement Learning in Google Maps

TL;DR

Scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms are introduced.

Abstract

Inverse reinforcement learning (IRL) offers a powerful and general framework for learning humans' latent preferences in route recommendation, yet no approach has successfully addressed planetary-scale problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL methods in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.
Paper Structure (43 sections, 3 theorems, 19 equations, 11 figures, 3 tables, 4 algorithms)

This paper contains 43 sections, 3 theorems, 19 equations, 11 figures, 3 tables, 4 algorithms.

Key Result

Theorem B.1

$\ell(\theta)<\infty$ iff $A$ has a dominant eigenvalue of 1.

Figures (11)

  • Figure 1: Google Maps route accuracy improvements in several world regions, when using our inverse reinforcement learning policy. Full results are presented in \ref{['tab:results']} and \ref{['fig:nodes_vs_accuracy']}.
  • Figure 2: Architecture overview. The final rewards are used to serve online routing requests.
  • Figure 3: rhip (Receding Horizon Inverse Planning)
  • Figure 4: Example of the 360M parameter sparse model finding and correcting a data quality error in Nottingham. The preferred route is incorrectly marked as private property due to the presence of a gate (which is never closed), and incorrectly incurs a high cost. The detour route is long and narrow. The sparse model learns to correct the data error with a large positive reward on the gated segment. Additional examples are provided in \ref{['app:experiments']}.
  • Figure 6: Sparse mixture-of-experts learn preferences specific to their geographic region, as demonstrated by the drop in off-diagonal performance.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Theorem B.1
  • proof
  • Theorem B.2
  • proof
  • Theorem B.3
  • proof