Table of Contents
Fetching ...

Trajectory Modeling via Random Utility Inverse Reinforcement Learning

Anselmo R. Pitombeira-Neto, Helano P. Santos, Ticiana L. Coelho da Silva, José Antonio F. de Macedo

TL;DR

The paper addresses modeling driver trajectories observed by sparse road-network sensors by introducing random utility inverse reinforcement learning (RU-IRL), which assumes deterministic optimal behavior under an unknown reward with unobserved utility $\epsilon$. It derives a Markov decision process with extended state $(s,\epsilon)$, proves existence of the value function $v_{\bm{\theta}}$ via contraction (for $0<\gamma<1$) and shows ME-IRL is a special case at $\gamma=1$, enabling exact normalization without enumerating trajectories. Parameters are estimated with Bayesian inference using Metropolis-Hastings, exploiting a scale invariance that makes only ratios $\beta_k/\alpha$ identifiable. A case study on Fortaleza data demonstrates RU-IRL can recover the relative importance of distance versus travel time and yield competitive next-location predictions with far fewer parameters than a Markov model. The approach offers a transparent, statistically principled alternative to black-box trajectory models and can be extended to capture user heterogeneity and sensor noise.

Abstract

We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. Cars are detected by sensors placed on sparsely distributed points on the street network of a city. As rational agents, drivers are trying to maximize some reward function unknown to an external observer. We apply the concept of random utility from econometrics to model the unknown reward function as a function of observed and unobserved features. In contrast to current inverse reinforcement learning approaches, we do not assume that agents act according to a stochastic policy; rather, we assume that agents act according to a deterministic optimal policy and show that randomness in data arises because the exact rewards are not fully observed by an external observer. We introduce the concept of extended state to cope with unobserved features and develop a Markov decision process formulation of drivers decisions. We present theoretical results which guarantee the existence of solutions and show that maximum entropy inverse reinforcement learning is a particular case of our approach. Finally, we illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city in Brazil.

Trajectory Modeling via Random Utility Inverse Reinforcement Learning

TL;DR

The paper addresses modeling driver trajectories observed by sparse road-network sensors by introducing random utility inverse reinforcement learning (RU-IRL), which assumes deterministic optimal behavior under an unknown reward with unobserved utility . It derives a Markov decision process with extended state , proves existence of the value function via contraction (for ) and shows ME-IRL is a special case at , enabling exact normalization without enumerating trajectories. Parameters are estimated with Bayesian inference using Metropolis-Hastings, exploiting a scale invariance that makes only ratios identifiable. A case study on Fortaleza data demonstrates RU-IRL can recover the relative importance of distance versus travel time and yield competitive next-location predictions with far fewer parameters than a Markov model. The approach offers a transparent, statistically principled alternative to black-box trajectory models and can be extended to capture user heterogeneity and sensor noise.

Abstract

We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. Cars are detected by sensors placed on sparsely distributed points on the street network of a city. As rational agents, drivers are trying to maximize some reward function unknown to an external observer. We apply the concept of random utility from econometrics to model the unknown reward function as a function of observed and unobserved features. In contrast to current inverse reinforcement learning approaches, we do not assume that agents act according to a stochastic policy; rather, we assume that agents act according to a deterministic optimal policy and show that randomness in data arises because the exact rewards are not fully observed by an external observer. We introduce the concept of extended state to cope with unobserved features and develop a Markov decision process formulation of drivers decisions. We present theoretical results which guarantee the existence of solutions and show that maximum entropy inverse reinforcement learning is a particular case of our approach. Finally, we illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city in Brazil.

Paper Structure

This paper contains 13 sections, 6 theorems, 57 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

For a real-valued function $f(s)$ defined on $\mathcal{S}$, let $g(s) = f(s)+c, \forall s \in \mathcal{S}$ and $c \in \mathbb{R}$. Then

Figures (5)

  • Figure 1: Set of 272 external sensors located on the street network in the city of Fortaleza, Brazil.
  • Figure 2: A sample trajectory. Initial and final dots indicate the origin and destination of a vehicle, respectively. Intermediary dots along the trajectory represent external sensors which detected the vehicle. The distance of two consecutive sensors which detected the vehicle may range from a few meters to a few kilometers.
  • Figure 3: Markov chain generated by Algorithm \ref{['alg:MH']}. Starting values were 0.01 and 15.0 for $\beta_1$ and $\beta_2$, respectively, obtained by running a Bayesian optimization algorithm for some iterations. For better visualization, only the first $5\times 10^3$ samples are shown.
  • Figure 4: Histograms of $5\times 10^3$ samples in the left tail of the Markov chain generated by Algorithm \ref{['alg:MH']}. Posterior means for $\beta_1$ and $\beta_2$ are $7.947 \times 10^{-5}$ and 13.67, respectively. The smooth curves over the histograms were obtained by kernel density estimation.
  • Figure 5: Accuracy of each of the alternative methods compared in the online next location prediction task. $\text{Acc}$ is the number of locations correctly predicted over the total number of locations observed in all trajectories in the holdout sample, while $\text{Acc}_{< 0.5}$ ($\text{Acc}_{< 1.0}$) counts an incorrectly predicted location within 0.5 km (1.0 km) of the correct location as a success. The value 100 denotes perfect prediction, i.e., 100% of all locations correctly predicted.

Theorems & Definitions (12)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • proof
  • ...and 2 more