Table of Contents
Fetching ...

Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving

Shota Yamazaki, Chenyu Zhang, Takuya Nanri, Akio Shigekane, Siyuan Wang, Jo Nishiyama, Tao Chu, Kohei Yokosawa

TL;DR

A reasoning model is proposed that takes future planning trajectories of the ego vehicle as inputs to solve the limitation of interpretability of decision-making process from perception to control of the ego vehicle.

Abstract

End-to-end style autonomous driving models have been developed recently. These models lack interpretability of decision-making process from perception to control of the ego vehicle, resulting in anxiety for passengers. To alleviate it, it is effective to build a model which outputs captions describing future behaviors of the ego vehicle and their reason. However, the existing approaches generate reasoning text that inadequately reflects the future plans of the ego vehicle, because they train models to output captions using momentary control signals as inputs. In this study, we propose a reasoning model that takes future planning trajectories of the ego vehicle as inputs to solve this limitation with the dataset newly collected.

Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving

TL;DR

A reasoning model is proposed that takes future planning trajectories of the ego vehicle as inputs to solve the limitation of interpretability of decision-making process from perception to control of the ego vehicle.

Abstract

End-to-end style autonomous driving models have been developed recently. These models lack interpretability of decision-making process from perception to control of the ego vehicle, resulting in anxiety for passengers. To alleviate it, it is effective to build a model which outputs captions describing future behaviors of the ego vehicle and their reason. However, the existing approaches generate reasoning text that inadequately reflects the future plans of the ego vehicle, because they train models to output captions using momentary control signals as inputs. In this study, we propose a reasoning model that takes future planning trajectories of the ego vehicle as inputs to solve this limitation with the dataset newly collected.

Paper Structure

This paper contains 18 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Pipeline of our proposed method. In order to improve accountability of an ego-vehicle action, trajectory planning information is embedded as a trajectory image and combined to a camera image in a Image-Trajectory Encoder.
  • Figure 2: Architectures of Image-Trajectory Encoders. (a) Concatenated. Both front camera image features and trajectory features are simply concatenated. (b) Overlaid. The trajectory image is overlaid on the camera image. (c) Cross-attention. With the camera image features as queries $Q$ and the trajectory features as keys $K$ and values $V$, cross attention layers extract fused features.
  • Figure 3: Examples of our dedicated dataset
  • Figure 4: Example of generated results
  • Figure 5: Limitation example of generated results