Table of Contents
Fetching ...

MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting

Mohammed Amine Bencheikh Lehocine, Julian Schmidt, Frank Moosmann, Dikshant Gupta, Fabian Flohr

TL;DR

This work proposes MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector that employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features.

Abstract

Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.

MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting

TL;DR

This work proposes MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector that employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features.

Abstract

Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.
Paper Structure (25 sections, 7 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 7 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Core idea of our method. Left: multi-frame input images, Right: (A) for each object query (i.e. hypothesis), taking the black car as example, we iteratively refine past trajectory hypotheses (yellow, orange, red), aggregate visual features along them, and perform appearance-guided scoring. $\mathbf{H_2}$ wins because it hits more visual features corresponding to the object. (B) Based on the selected past trajectory and aggregated features, multiple modes of future trajectories (blue) are predicted.
  • Figure 2: MASAR architecture: The scene encoder encodes all multi-frame, multi-view images into BEVs. Detector decoder: $L_d$ layers iteratively refine object detections together with their past trajectory estimates. Forecasting decoder: $L_f$ layers iteratively forecast multiple trajectory modes for each detected object based on its estimated past trajectory and its visual features along that trajectory.
  • Figure 3: Visualization of some nuScenes validation samples. Green: ego, purple: detections, red: ground-truth. Only future trajectories are plotted; past trajectories are shown in magnified views. The rendered maps are just for visualization and not input to the model. (a) and (c) show challenging crowded scenes, (b) and (d) diverse multi-modal futures, and (e) and (f) typical failure cases from missing context.