Table of Contents
Fetching ...

MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation

Yuxiang Fu, Qi Yan, Lele Wang, Ke Li, Renjie Liao

TL;DR

MoFlow addresses multi-modal human trajectory forecasting by modeling $K$ correlated future paths with a conditional flow matching framework. A data-space reformulation of flow matching, combined with a multi-modal objective, yields diverse, accurate predictions, while an IMLE-based distillation enables one-step sampling without sacrificing quality. The teacher model achieves state-of-the-art results on NBA, ETH-UCY, and SDD, and the IMLE student delivers comparable accuracy at about 100× faster sampling, making deployment practical for time-critical settings. This work advances probabilistic trajectory forecasting by unifying flow-based generation with efficient, principled distillation, offering both accuracy and runtime benefits for real-world scenarios.

Abstract

In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. We propose a novel motion prediction conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene. We design a novel flow matching loss function that not only ensures at least one of the $K$ sets of future trajectories is accurate but also encourages all $K$ sets of future trajectories to be diverse and plausible. Furthermore, by leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is $\textbf{100}$ times faster than the teacher flow model during sampling. The code, model, and data are available at our project page: https://moflow-imle.github.io

MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation

TL;DR

MoFlow addresses multi-modal human trajectory forecasting by modeling correlated future paths with a conditional flow matching framework. A data-space reformulation of flow matching, combined with a multi-modal objective, yields diverse, accurate predictions, while an IMLE-based distillation enables one-step sampling without sacrificing quality. The teacher model achieves state-of-the-art results on NBA, ETH-UCY, and SDD, and the IMLE student delivers comparable accuracy at about 100× faster sampling, making deployment practical for time-critical settings. This work advances probabilistic trajectory forecasting by unifying flow-based generation with efficient, principled distillation, offering both accuracy and runtime benefits for real-world scenarios.

Abstract

In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. We propose a novel motion prediction conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene. We design a novel flow matching loss function that not only ensures at least one of the sets of future trajectories is accurate but also encourages all sets of future trajectories to be diverse and plausible. Furthermore, by leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is times faster than the teacher flow model during sampling. The code, model, and data are available at our project page: https://moflow-imle.github.io

Paper Structure

This paper contains 35 sections, 13 equations, 11 figures, 9 tables, 3 algorithms.

Figures (11)

  • Figure 1: Overview of our MoFlow model, which consists of a social temporal encoder for contextual cues and a motion decoder that predicts $K$-shot future trajectories of all agents in a scene.
  • Figure 2: The overall scheme of our proposed MoFlow model and the IMLE distillation framework. Top: Solid lines mean past trajectories; dotted lines show multi-modal future predictions. The teacher MoFlow model predicts by solving the denoising ODE, with the green paths being ODE solutions mapping samples from noise to data. Bottom: The IMLE objective trains a student model for one-step inference by minimizing the distance between a teacher model sample and its closest counterpart from the student model, as indicated by the arrows.
  • Figure 3: Qualitative results on the NBA dataset. (a) We compare between the best-of-20 predictions from our MoFlow IMLE distillation method, the best-of-20 predictions the LED method, and the ground truth future trajectories. The visualization demonstrates that our approach produces predictions that more closely align with the ground truth trajectories compared to the LED model. (b) We are using two same scenes as (a). This figure delineates the diversity of samples from our IMLE generator. Our method generates a prediction that conforms to the GT trajectory. (Pink: the sample closest to the ground truth in $L_2$ sense among $K=20$ predictions.)
  • Figure 4: The qualitative results on the ETH-UCY dataset show that our MoFlow IMLE distillation model’s best-of-20 predictions (selected via the lowest FDE) closely match the ground truth future trajectories, capturing important motion nuances.
  • Figure 5: Qualitative results on NBA dataset in terms of diversity. Our method generates diverse samples that are more socially plausible. Some of the trajectories generated by LED model, which are highlighted by red circles, do not adhere to the basketball game patterns or rules. (Light color indicates past trajectory while dark color means future trajectory; blue/orange/green color: two teams and the basketball; pink color: the sample that is the closest to the Ground truth in $L_2$ sense among $K=20$ predictions)
  • ...and 6 more figures