Unified Human Localization and Trajectory Prediction with Monocular Vision

Po-Chien Luan; Yang Gao; Celine Demonsant; Alexandre Alahi

Unified Human Localization and Trajectory Prediction with Monocular Vision

Po-Chien Luan, Yang Gao, Celine Demonsant, Alexandre Alahi

TL;DR

This work tackles robust human localization and trajectory prediction using only monocular vision. It introduces MonoTransmotion (MT), a Transformer-based framework with two jointly trained modules for BEV localization and trajectory prediction, augmented by a directional loss to smooth sequential localization. MT demonstrates strong, real-world performance on both curated (NuScenes) and non-curated (HEADS-UP) datasets, outperforming baselines and showing resilience to noisy inputs while operating in real time. By eliminating LiDAR, MT reduces cost and hardware requirements, highlighting the pivotal role of accurate BEV localization in reliable trajectory forecasting for mobile robotics and assistive applications.

Abstract

Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.

Unified Human Localization and Trajectory Prediction with Monocular Vision

TL;DR

Abstract

Unified Human Localization and Trajectory Prediction with Monocular Vision

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)