Table of Contents
Fetching ...

DINO-Foresight: Looking into the Future with DINO

Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

TL;DR

DINO-Foresight reframes future-frame prediction by forecasting semantically rich Vision Foundation Model (VFM) features rather than raw RGB frames. A self-supervised masked transformer predicts the temporal evolution of multi-layer VFM features, which can be consumed by off-the-shelf task heads for semantic segmentation, instance segmentation, depth, and surface normals. The approach uses hierarchical, PCA-reduced target features and compute-efficient high-resolution training strategies, achieving strong multi-task performance with a single model on Cityscapes and nuScenes. This modular, semantically grounded forecasting framework offers significant scalability and potential generalization benefits for autonomous driving and robotics applications.

Abstract

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .

DINO-Foresight: Looking into the Future with DINO

TL;DR

DINO-Foresight reframes future-frame prediction by forecasting semantically rich Vision Foundation Model (VFM) features rather than raw RGB frames. A self-supervised masked transformer predicts the temporal evolution of multi-layer VFM features, which can be consumed by off-the-shelf task heads for semantic segmentation, instance segmentation, depth, and surface normals. The approach uses hierarchical, PCA-reduced target features and compute-efficient high-resolution training strategies, achieving strong multi-task performance with a single model on Cityscapes and nuScenes. This modular, semantically grounded forecasting framework offers significant scalability and potential generalization benefits for autonomous driving and robotics applications.

Abstract

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Forecasting VFM Features for Future Frames. At the core of our approach is the prediction of future VFM feature evolution. To this end, we train a masked transformer model in a self-supervised manner to forecast these features from context frames, minimizing SmoothL1 loss between predicted and actual future features. By forecasting these rich and versatile features, task-specific prediction heads—such as semantic segmentation, depth, and surface normals—can be effortlessly employed at test time, enabling modular and efficient multi-task scene understanding.
  • Figure 2: Hierarchical Target Feature Construction for the Feature Prediction Model. Our framework constructs a feature space by extracting and concatenating multi-layer features from a frozen ViT encoder, capturing semantic information at varying abstraction levels. PCA is applied to reduce dimensionality, creating compact features.
  • Figure 3: Future predictions for semantic segmentation, depth, and surface normals. Noisy segmentations at the bottom of the image (in both predicted and Oracle results) are due to unannotated regions in Cityscapes that are ignored during DPT training. This artifact affects only segmentation, not the predicted features, as evident in the clear depth and surface normal predictions.
  • Figure 4: Impact of Intermediate Transformer Features on Future Segmentation and Depth Prediction. Results are shown for semantic segmentation and depth prediction heads using two feature sets: only the VFM features predicted by the masked feature transformer (dashed line) and combined features from both predicted and intermediate transformer layers (blue bars). We evaluate DPT heads trained on features from the 6th, 9th, 10th, 11th, and 12th layers. For segmentation (barplots (a) and (b)), we report mIoU across all classes. For depth (barplots (c) and (d)), we show the reduction in AbsRel metric (higher is better) when adding intermediate layer features.
  • Figure 5: Visualization of future predictions for semantic segmentation, depth, and surface normals. The illustrated scene is Frankfurt (01 (017082-017111)).
  • ...and 3 more figures