DINO-Foresight: Looking into the Future with DINO
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
TL;DR
DINO-Foresight reframes future-frame prediction by forecasting semantically rich Vision Foundation Model (VFM) features rather than raw RGB frames. A self-supervised masked transformer predicts the temporal evolution of multi-layer VFM features, which can be consumed by off-the-shelf task heads for semantic segmentation, instance segmentation, depth, and surface normals. The approach uses hierarchical, PCA-reduced target features and compute-efficient high-resolution training strategies, achieving strong multi-task performance with a single model on Cityscapes and nuScenes. This modular, semantically grounded forecasting framework offers significant scalability and potential generalization benefits for autonomous driving and robotics applications.
Abstract
Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .
