Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction
Hy Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
TL;DR
This work tackles semantic dilution in transformer-based video frame prediction by introducing Semantic Concentration Multi-Head Self-Attention (SCMHSA) and an embedding-space loss to align training with the model's latent outputs. The proposed SC-VFP architecture embeds frames with ViT [CLS], processes them through SCMHSA-enabled encoder blocks, and predicts the next frame as an embedding via an MLP; losses are defined in embedding space rather than on pixel reconstructions. Empirically, SC-VFP delivers strong results on UCSD Pedestrian, UCF Sports, and Penn Action, with notable gains on larger datasets, while showing limited improvement on the small KTH dataset, highlighting scalability benefits and data-size sensitivity. The approach offers a pathway to more semantically faithful predictions and has potential applications in anomaly detection and tracking where embedding-level representations are critical.
Abstract
Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. The primary challenge in next-frame prediction lies in effectively capturing and processing both spatial and temporal information from previous video sequences. The transformer architecture, known for its prowess in handling sequence data, has made remarkable progress in this domain. However, transformer-based next-frame prediction models face notable issues: (a) The multi-head self-attention (MHSA) mechanism requires the input embedding to be split into $N$ chunks, where $N$ is the number of heads. Each segment captures only a fraction of the original embeddings information, which distorts the representation of the embedding in the latent space, resulting in a semantic dilution problem; (b) These models predict the embeddings of the next frames rather than the frames themselves, but the loss function based on the errors of the reconstructed frames, not the predicted embeddings -- this creates a discrepancy between the training objective and the model output. We propose a Semantic Concentration Multi-Head Self-Attention (SCMHSA) architecture, which effectively mitigates semantic dilution in transformer-based next-frame prediction. Additionally, we introduce a loss function that optimizes SCMHSA in the latent space, aligning the training objective more closely with the model output. Our method demonstrates superior performance compared to the original transformer-based predictors.
