Table of Contents
Fetching ...

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

TL;DR

The paper tackles semantic future prediction for autonomous systems by proposing FUTURIST, a unified multimodal visual sequence transformer trained with a multimodal masked visual modeling objective. It eliminates the need for VAE-based tokenizers through a VAE-free, hierarchical tokenization and introduces cross-modality fusion and novel masking strategies to leverage complementary modalities. Pseudo-labels from off-the-shelf models enable end-to-end training without manual annotations, and the approach achieves state-of-the-art results on Cityscapes for short- and mid-term future semantic segmentation and depth. The work demonstrates the efficacy of multimodal semantic forecasting and offers scalable, efficient techniques for real-world deployment in dynamic environments.

Abstract

Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. Project page and code at https://futurist-cvpr2025.github.io/ .

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

TL;DR

The paper tackles semantic future prediction for autonomous systems by proposing FUTURIST, a unified multimodal visual sequence transformer trained with a multimodal masked visual modeling objective. It eliminates the need for VAE-based tokenizers through a VAE-free, hierarchical tokenization and introduces cross-modality fusion and novel masking strategies to leverage complementary modalities. Pseudo-labels from off-the-shelf models enable end-to-end training without manual annotations, and the approach achieves state-of-the-art results on Cityscapes for short- and mid-term future semantic segmentation and depth. The work demonstrates the efficacy of multimodal semantic forecasting and offers scalable, efficient techniques for real-world deployment in dynamic environments.

Abstract

Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. Project page and code at https://futurist-cvpr2025.github.io/ .
Paper Structure (27 sections, 2 equations, 11 figures, 8 tables)

This paper contains 27 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: End-to-End Trainable Multimodal Visual Sequence Transformer. Our framework employs modality-specific embedders to map inputs into tokens. Then, these token embeddings are fused through token-wise concatenation to form the combined input embedding $\mathbf{Z}$. Next, a masked transformer processes $\mathbf{Z}$, capturing spatiotemporal dependencies between modalities, and outputs $\tilde{\mathbf{Z}}$. Finally, modality-specific decoders produce future frame predictions for each modality based on $\tilde{\mathbf{Z}}$, enabling efficient and accurate multimodal semantic future prediction within a unified architecture. Tokens are shown in 2D (not flattened) for visualization purposes.
  • Figure 2: Modality Embedder. Our method embeds per-pixel values and aggregates them into patch tokens.
  • Figure 3: Masking strategies. We explored four strategies for masking future-frame tokens across modalities: Fully Masked (all tokens masked), Fully Independent (independently masked tokens per modality), Fully Shared (shared mask across modalities), and Partially Shared + Exclusive (shared mask for some tokens, with others masked exclusively per modality).
  • Figure 4: Scaling Training Epochs. Model performance for segmentation forecasting (all classes and movable objects) and depth forecasting, as a function of the number of training epochs. Results for (a) short-term prediction and (b) mid-term prediction.
  • Figure 5: Qualitative comparison of future semantic segmentation and depth prediction. Oracle results are derived from Segmenter strudel2021segmenter (segmentation) and DepthAnythingV2 yang2024depth (depth). We compare against the state-of-the-art method VISTA gao2024vista, which generates future RGB frames via video diffusion, followed by Segmenter and DepthAnythingV2 for predictions.
  • ...and 6 more figures