Table of Contents
Fetching ...

Learning Priors of Human Motion With Vision Transformers

Placido Falqueto, Alberto Sanfeliu, Luigi Palopoli, Daniele Fontanelli

TL;DR

The paper tackles predicting occupancy priors of humans in semantic maps to enable safe robot navigation. It introduces semapp2, a Vision Transformer–based autoencoder (with a MAE variant) that processes semantic maps to learn occupancy distributions, aiming for real-time inference. On the Stanford Drone Dataset, semapp2 outperforms CNN baselines in metrics such as $KL$ divergence and $EMD$, and demonstrates robust velocity/stop priors, with MAE-semapp2 showing strong generalization in qualitative assessments. The approach offers practical benefits for autonomous navigation and cobot integration, with potential extensions to additional agents and production environments.

Abstract

A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments' results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.

Learning Priors of Human Motion With Vision Transformers

TL;DR

The paper tackles predicting occupancy priors of humans in semantic maps to enable safe robot navigation. It introduces semapp2, a Vision Transformer–based autoencoder (with a MAE variant) that processes semantic maps to learn occupancy distributions, aiming for real-time inference. On the Stanford Drone Dataset, semapp2 outperforms CNN baselines in metrics such as divergence and , and demonstrates robust velocity/stop priors, with MAE-semapp2 showing strong generalization in qualitative assessments. The approach offers practical benefits for autonomous navigation and cobot integration, with potential extensions to additional agents and production environments.

Abstract

A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments' results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.

Paper Structure

This paper contains 22 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The semapp2 architecture.
  • Figure 2: semapp2 variation using a MAE autoencoder.
  • Figure 3: Qualitative comparison of results in the Stanford Drone Dataset. Our ViT-based model showcases competitive performance compared to semapp (Rudenko et al. rudenko), demonstrating the effectiveness of Vision Transformers in predicting occupancy priors. Top left: presents the original semantic map highlighting different classes, Top right: displays the ground-truth distribution of occupancies. Bottom left, Bottom middle and Bottom right showcase the predictions generated by semapp, semapp2 and MAE-semapp2, respectively.
  • Figure 4: Qualitative comparison between using 9 semantic classes (Left) and 13 semantic classes (Right)
  • Figure 5: Example of prediction using the MAE-based semapp2.
  • ...and 2 more figures