Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
TL;DR
The paper tackles semantic future prediction for autonomous systems by proposing FUTURIST, a unified multimodal visual sequence transformer trained with a multimodal masked visual modeling objective. It eliminates the need for VAE-based tokenizers through a VAE-free, hierarchical tokenization and introduces cross-modality fusion and novel masking strategies to leverage complementary modalities. Pseudo-labels from off-the-shelf models enable end-to-end training without manual annotations, and the approach achieves state-of-the-art results on Cityscapes for short- and mid-term future semantic segmentation and depth. The work demonstrates the efficacy of multimodal semantic forecasting and offers scalable, efficient techniques for real-world deployment in dynamic environments.
Abstract
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. Project page and code at https://futurist-cvpr2025.github.io/ .
