Table of Contents
Fetching ...

MS-Net: A Multi-Path Sparse Model for Motion Prediction in Multi-Scenes

Xiaqiang Tang, Weigao Sun, Siyuan Hu, Yiyang Sun, Yafeng Guo

TL;DR

Multi-Scenes Network (aka. MS-Net), which is a multi-path sparse model trained by an evolutionary process, which outperforms existing state-of-the-art methods on well-established pedestrian motion prediction datasets, e.g., ETH and UCY, and ranks the 2nd place on the INTERACTION challenge.

Abstract

The multi-modality and stochastic characteristics of human behavior make motion prediction a highly challenging task, which is critical for autonomous driving. While deep learning approaches have demonstrated their great potential in this area, it still remains unsolved to establish a connection between multiple driving scenes (e.g., merging, roundabout, intersection) and the design of deep learning models. Current learning-based methods typically use one unified model to predict trajectories in different scenarios, which may result in sub-optimal results for one individual scene. To address this issue, we propose Multi-Scenes Network (aka. MS-Net), which is a multi-path sparse model trained by an evolutionary process. MS-Net selectively activates a subset of its parameters during the inference stage to produce prediction results for each scene. In the training stage, the motion prediction task under differentiated scenes is abstracted as a multi-task learning problem, an evolutionary algorithm is designed to encourage the network search of the optimal parameters for each scene while sharing common knowledge between different scenes. Our experiment results show that with substantially reduced parameters, MS-Net outperforms existing state-of-the-art methods on well-established pedestrian motion prediction datasets, e.g., ETH and UCY, and ranks the 2nd place on the INTERACTION challenge.

MS-Net: A Multi-Path Sparse Model for Motion Prediction in Multi-Scenes

TL;DR

Multi-Scenes Network (aka. MS-Net), which is a multi-path sparse model trained by an evolutionary process, which outperforms existing state-of-the-art methods on well-established pedestrian motion prediction datasets, e.g., ETH and UCY, and ranks the 2nd place on the INTERACTION challenge.

Abstract

The multi-modality and stochastic characteristics of human behavior make motion prediction a highly challenging task, which is critical for autonomous driving. While deep learning approaches have demonstrated their great potential in this area, it still remains unsolved to establish a connection between multiple driving scenes (e.g., merging, roundabout, intersection) and the design of deep learning models. Current learning-based methods typically use one unified model to predict trajectories in different scenarios, which may result in sub-optimal results for one individual scene. To address this issue, we propose Multi-Scenes Network (aka. MS-Net), which is a multi-path sparse model trained by an evolutionary process. MS-Net selectively activates a subset of its parameters during the inference stage to produce prediction results for each scene. In the training stage, the motion prediction task under differentiated scenes is abstracted as a multi-task learning problem, an evolutionary algorithm is designed to encourage the network search of the optimal parameters for each scene while sharing common knowledge between different scenes. Our experiment results show that with substantially reduced parameters, MS-Net outperforms existing state-of-the-art methods on well-established pedestrian motion prediction datasets, e.g., ETH and UCY, and ranks the 2nd place on the INTERACTION challenge.
Paper Structure (14 sections, 7 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 14 sections, 7 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: A comparison between the current unified motion prediction model (top) and our MS-Net (down). Note that the model structure in the figure is illustrative and doesn't reflect our actual experimental setup.
  • Figure 2: The overall training and inference processes of MS-Net. We choose a meta-model to initialize the Knowledge Pool. For new scenarios, a template model is selected from the pool, and sub-models are generated using an evolutionary algorithm. These sub-models are trained for scenario-specific knowledge. An evaluation function balances accuracy with parameter count. Since most parameters are inherited, we only need to add additional parameters (i.e., new knowledge) to the Knowledge Pool. In the inference stage, we form a sparse model from the Knowledge Pool, activating only a small part of all parameters for each scenario to achieve a scenario-distinct motion prediction model.
  • Figure 3: The MS-Net structure diagram obtained by training on ETH/UCY. We use seven training sets from ETH/UCY as the seven separate scenarios, and take the model from predictionTransfomer as the meta-model. The modules with the same color in the figure indicate the network layers obtained from the same scenario, and it can be seen that the layers such as the encoder, and embedding layers are commonly reused by each task. In more complex scenarios, such as the student003 dataset, the network adaptively adds "Encoder Layer 1" for better handling such scenarios.
  • Figure 4: Knowledge Transfer and Model Evolution in MS-Net training. On the left, network parameters from the meta-model are inherited through a knowledge transfer process, solid layers are trainable in the sub-model, while hatched layers are non-trainable and reused. In the Model Evolution approach on the right, the sub-model adds a trainable layer (e.g., decoder) while reusing other frozen layers from the meta-model.
  • Figure 5: Comparation of AutoBotautobot with MS-Net on INTERACTION validation set. The past trajectories are shown in yellow, the ground-truth trajectories are shown in red, and the predicted trajectories are shown in green.