Table of Contents
Fetching ...

Unified Human Localization and Trajectory Prediction with Monocular Vision

Po-Chien Luan, Yang Gao, Celine Demonsant, Alexandre Alahi

TL;DR

This work tackles robust human localization and trajectory prediction using only monocular vision. It introduces MonoTransmotion (MT), a Transformer-based framework with two jointly trained modules for BEV localization and trajectory prediction, augmented by a directional loss to smooth sequential localization. MT demonstrates strong, real-world performance on both curated (NuScenes) and non-curated (HEADS-UP) datasets, outperforming baselines and showing resilience to noisy inputs while operating in real time. By eliminating LiDAR, MT reduces cost and hardware requirements, highlighting the pivotal role of accurate BEV localization in reliable trajectory forecasting for mobile robotics and assistive applications.

Abstract

Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.

Unified Human Localization and Trajectory Prediction with Monocular Vision

TL;DR

This work tackles robust human localization and trajectory prediction using only monocular vision. It introduces MonoTransmotion (MT), a Transformer-based framework with two jointly trained modules for BEV localization and trajectory prediction, augmented by a directional loss to smooth sequential localization. MT demonstrates strong, real-world performance on both curated (NuScenes) and non-curated (HEADS-UP) datasets, outperforming baselines and showing resilience to noisy inputs while operating in real time. By eliminating LiDAR, MT reduces cost and hardware requirements, highlighting the pivotal role of accurate BEV localization in reliable trajectory forecasting for mobile robotics and assistive applications.

Abstract

Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.

Paper Structure

This paper contains 18 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of our task and conventional pipeline. (a) Conventional settings directly obtain curated localization from LiDAR or other sensors paez20213d. (b) Our task focuses on leveraging only a monocular camera to estimate BEV localization and using the estimates to predict trajectories. We present a unified model that jointly solves localization and prediction.
  • Figure 2: The overview of the complete pipeline. MonoTransmotion (MT) utilizes estimated 2D human poses as input to simultaneously address both BEV localization and trajectory prediction tasks.
  • Figure 3: Qualitative results on NuScenes. We visualize four frames of images at consistent intervals, presented in order from left to right. (a) MT achieves better accuracy and smoothness in trajectory prediction. (b) Although there are some deviations in localization, MT maintains the correct direction due to its sequential model and the use of directional loss. (c) MT does not always predict the exact trajectory; however, its sequential approach to localization and trajectory prediction results in a smoother overall trajectory.
  • Figure 4: Impact of different distances. We observe that monocular-based BEV localization methods have higher errors when agents are farther away. This increased error in BEV localization significantly impacts the accuracy of trajectory prediction results.
  • Figure 5: Qualitative results on HEADS-UP dataset. (a) MT offers improved direction and accuracy, even when some keypoints are missing. (b) Monocular methods lose accuracy as the target distance increases. The ground truth from the real-world pipeline contains more noise compared to the carefully curated NuScenes.