TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

Liangyu Xu; Wanxuan Lu; Hongfeng Yu; Yongqiang Mao; Hanbo Bi; Chenglong Liu; Xian Sun; Kun Fu

TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

Liangyu Xu, Wanxuan Lu, Hongfeng Yu, Yongqiang Mao, Hanbo Bi, Chenglong Liu, Xian Sun, Kun Fu

TL;DR

A novel task called target-aware aerial video prediction, aiming to simultaneously predict future scenes and motion states of the target, is introduced, and an information-sharing mechanism (ISM) is designed, which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens.

Abstract

As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely on predicting future scenes (video frames), suffering from the neglect of explicitly modeling target's motion states, which is crucial for aerial video interpretation. To address this issue, we introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target. Further, we design a model specifically for this task, named TAFormer, which provides a unified modeling approach for both video and target motion states. Specifically, we introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion. Additionally, we design an Information Sharing Mechanism (ISM), which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens. Moreover, to alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content. Extensive experiments on UAV123VP and VisDroneVP (derived from single-object tracking datasets) demonstrate the exceptional performance of TAFormer in target-aware video prediction, showcasing its adaptability to the additional requirements of aerial video interpretation for target awareness.

TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

TL;DR

Abstract

Paper Structure (29 sections, 23 equations, 9 figures, 8 tables)

This paper contains 29 sections, 23 equations, 9 figures, 8 tables.

Introduction
Related Work
Video Predcition
Motion Prediction
BEV-based Future Prediction for Autonomous Driving
Method
Problem Definition
Overview
Spatiotemporal Attention
Information Sharing Mechanism through Messengers
Messenger Initialization
Message Collecting
Message Passing
Target-Sensitive Gaussian Loss
Experiments
...and 14 more sections

Figures (9)

Figure 1: Comparison between existing tasks and ours. (a) Video prediction methods focus on predicting future frames using historical video frames. (c) Motion prediction methods utilize a static map as an additional input, predicting future motion states based on the historical movement patterns of the target. (b) The proposed Target-Aware aerial Video Prediction task, which not only predicts the overall evolution of the environment but also focus on the motion state of target, achieving a more integrated spatiotemporal prediction.
Figure 2: The overall framework of TAFormer. Provided a sequence of video frames $\mathcal{X}_{t,T}=\{\boldsymbol{x}_{i}\}_{t-T+1}^{t}$ and the corresponding bounding box sequence $\mathcal{B}_{t,T}=\{\boldsymbol{b}_{i}\}_{t-T+1}^{t}$ for the interested target, spanning the past $T$ frames up to time $t$, TAFormer is capable to forecast the succeeding $T'$ video frames $\mathcal{Y}_{t+1, T'}=\{\boldsymbol{x}_{i}\}_{t+1}^{t+1+T'}$ and bounding boxes for the target of interest $\mathcal{C}_{t+1, T'}=\{\boldsymbol{b}_{i}\}_{t+1}^{t+1+T'}$ commencing from time $t + 1$. FE(S) and FE(B) represent spatial feature embedding and bounding box feature embedding, respectively.
Figure 3: Details of the Spatiotemporal Attention. It consists of spatial attention and temporal attention, and the final attention is the product of them.
Figure 4: Details of the ISM. Different background colors represent distinct processing steps. The left half illustrates the Messengers Initialization process, while the upper and lower sections on the right depict the Message Collecting and Message Passing processes, respectively.
Figure 5: Target-sensitive Gaussian loss. By applying Gaussian weighting to the predicted frames and ground truth, we highlight the regions relevant to the interested target, achieving content-awareness and position-awareness for the target.
...and 4 more figures

TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

TL;DR

Abstract

TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (9)