MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

Baotian He; Yibing Li

MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

Baotian He, Yibing Li

TL;DR

The model-based reinforcement-imitation learning framework with a temporally abstracted mixture-of-codebooks (MRIC) significantly outperforms strong baselines in diversity, behavioral realism, and distributional fidelity, achieving notable improvements in metrics such as collision rate, minSADE, and time-to-collision JSD.

Abstract

Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).

MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

TL;DR

Abstract

Paper Structure (64 sections, 3 theorems, 51 equations, 14 figures, 12 tables, 1 algorithm)

This paper contains 64 sections, 3 theorems, 51 equations, 14 figures, 12 tables, 1 algorithm.

Introduction
Related works
Open-Loop Behavior Learning
Closed-Loop Behavior Learning
Diverse Behavior Learning
Preliminary
Problem Formulation
Core Challenges
Methodology
Overall Architecture
Differentiable Simulator
Components
Gradient flow
Temporally Abstracted Mixture-of-Codebooks
Hierarchy
...and 49 more sections

Key Result

Proposition 1

Assume that the observation model and the policy respectively take the current state and observation as inputs, namely $p(\bm{o}_{t}|s_{t})$ and $\bm{\pi}_{\theta}(\bm{a}_{t}|\bm{o}_{t})$, and that the objective function is defined over the state sequence: $J = J(s_{1:T})$. After unfolding the polic where defining $\prod_{k=t-1}^{t} \frac{\partial s_{k+1}}{\partial s_{k}} = I$.

Figures (14)

Figure 1: Comparison of generated behaviors and state-visitation distribution between pure IL variant and the proposed MRIC framework. (a, b): Lime green agents are simulated by the policy. Dynamic visualizations are provided in the captions. (c, d): The y-axis represents frequency. State-matching via closed-loop differentiable simulation provides meaningful learning signals and achieves efficient credit assignment minsky1961steps. However, the policy within this paradigm is scarcely constrained in low-density trajectory regions. MRIC introduces the model-based RL regularization to provide domain knowledge in subregions not covered by the data distribution, thereby effectively improving behavioral and distributional realism.
Figure 2: Overall framework. The proposed MRIC uses closed-loop model-based imitation learning (IL) as its primary objective. This is supplemented by open-loop model-based IL and model-based reinforcement learning (RL) for regularization. Together, these form a constrained policy optimization problem. The optimal multipliers are obtained through a dynamic multiplier mechanism that solves the multiplier equation. Lastly, the Lagrangian function acts as the dynamic objective for updating the policy at each step.
Figure 3: The proposed temporally abstracted codebook mixture module. (a) Hierarchical policy with mixture-of-codebooks. (b) Temporal abstraction mechanism.
Figure 4: Probabilistic graphical model representation of a driving scenario. The shaded nodes represent observable variables, while the blank nodes represent unobservable variables. Potential links spanning multiple time steps are omitted for clarity. (a) Generation process. (b) Inference process.
Figure 5: Comparison of gradient flows. (a) Behavior cloning. (b) Open-loop model-based IL via state-matching. (c) Closed-loop model-based IL via state-matching.
...and 9 more figures

Theorems & Definitions (6)

Proposition 1
Remark 1
Proposition 2
Remark 2
Proposition 3
Remark 3

MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

TL;DR

Abstract

MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (6)