Table of Contents
Fetching ...

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

Jaewoo Jeong, Seohee Lee, Daehee Park, Giwon Lee, Kuk-Jin Yoon

TL;DR

This work addresses pedestrian trajectory forecasting by leveraging multi-modal information (trajectory, 3D pose, and text) while mitigating computational costs through knowledge distillation. A teacher-student framework is proposed, where a full-modality teacher guides a student that operates with limited modalities, with explicit intra-agent and inter-agent latent alignments. The approach demonstrates consistent performance gains across ego-view and BEV-view datasets, with text-driven cues playing a key role in bridging modality gaps, and achieves up to approximately 13% improvements in forecasting metrics. The framework is versatile, generalizable across models (HiVT and MART) and datasets (JRDB, SIT, ETH/UCY), and offers practical benefits for resource-constrained systems without sacrificing predictive accuracy.

Abstract

Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

TL;DR

This work addresses pedestrian trajectory forecasting by leveraging multi-modal information (trajectory, 3D pose, and text) while mitigating computational costs through knowledge distillation. A teacher-student framework is proposed, where a full-modality teacher guides a student that operates with limited modalities, with explicit intra-agent and inter-agent latent alignments. The approach demonstrates consistent performance gains across ego-view and BEV-view datasets, with text-driven cues playing a key role in bridging modality gaps, and achieves up to approximately 13% improvements in forecasting metrics. The framework is versatile, generalizable across models (HiVT and MART) and datasets (JRDB, SIT, ETH/UCY), and offers practical benefits for resource-constrained systems without sacrificing predictive accuracy.

Abstract

Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.

Paper Structure

This paper contains 26 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Multi-modal data such as human pose and text greatly improve trajectory forecasting performance. However, expensive modalities such as text are not readily available during application. Thus, we transfer the extensive knowledge from full modalities to a student model operating on a limited set of modalities.
  • Figure 2: We first pre-train a teacher model that leverages the full range of modalities, upon which a student model with limited modalities ($\mathcal{X}+\mathcal{P}$ or $\mathcal{X}$) is distilled from scratch. Regression losses for three observation settings ($T_p,2,1$) are applied to both teacher and student, while additional KD losses guide the student to robustly encode intra-agent modalities ($Q$) and inter-agent interactions ($H$).
  • Figure 3: Qualitative results on JRDB with $\mathcal{X}+\mathcal{P}$ HiVT model. The model outperforms its baseline counterpart with KD. Improved accuracy is demonstrated on all instantaneous and full observations. Origin denotes robot position, and bubbles represent text annotations.