ForecastOcc: Vision-based Semantic Occupancy Forecasting
Riya Mohan, Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada
TL;DR
ForecastOcc tackles the problem of predicting future 3D semantic occupancy directly from multi-view images, without relying on externally estimated occupancy maps. It introduces a forecasting module that uses temporal cross-attention to synthesize horizon-aware image representations, which are then lifted into 3D via a view transformer and refined by a 3D occupancy encoder and a semantic head. The approach is validated on Occ3D-nuScenes and SemanticKITTI, with extensive ablations showing the benefits of the Future State Alignment loss, contextual embeddings, and multi-view training. Results show consistent improvements over baselines across horizons and settings, demonstrating strong temporal-semantic reasoning for autonomous driving applications.
Abstract
Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
