Table of Contents
Fetching ...

ForecastOcc: Vision-based Semantic Occupancy Forecasting

Riya Mohan, Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

TL;DR

ForecastOcc tackles the problem of predicting future 3D semantic occupancy directly from multi-view images, without relying on externally estimated occupancy maps. It introduces a forecasting module that uses temporal cross-attention to synthesize horizon-aware image representations, which are then lifted into 3D via a view transformer and refined by a 3D occupancy encoder and a semantic head. The approach is validated on Occ3D-nuScenes and SemanticKITTI, with extensive ablations showing the benefits of the Future State Alignment loss, contextual embeddings, and multi-view training. Results show consistent improvements over baselines across horizons and settings, demonstrating strong temporal-semantic reasoning for autonomous driving applications.

Abstract

Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.

ForecastOcc: Vision-based Semantic Occupancy Forecasting

TL;DR

ForecastOcc tackles the problem of predicting future 3D semantic occupancy directly from multi-view images, without relying on externally estimated occupancy maps. It introduces a forecasting module that uses temporal cross-attention to synthesize horizon-aware image representations, which are then lifted into 3D via a view transformer and refined by a 3D occupancy encoder and a semantic head. The approach is validated on Occ3D-nuScenes and SemanticKITTI, with extensive ablations showing the benefits of the Future State Alignment loss, contextual embeddings, and multi-view training. Results show consistent improvements over baselines across horizons and settings, demonstrating strong temporal-semantic reasoning for autonomous driving applications.

Abstract

Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
Paper Structure (23 sections, 5 equations, 4 figures, 7 tables)

This paper contains 23 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Different scene forecasting frameworks. Existing approaches include 2D semantic forecasting from single images which is limited to image-plane predictions. Image-based occupancy forecasting predicts future occupancy but lacks semantic understanding. Occupancy-based semantic forecasting requires occupancy predictions as input. In contrast, our vision-based semantic occupancy forecasting directly leverages multi-view images to jointly forecast both occupancy and semantics, enabling semantically rich 3D future scene understanding.
  • Figure 2: Architecture of ForecastOcc for semantic occupancy forecasting. Multi-view images from the past, current, and future are used as input, with future views only during training. Each image is encoded to 2D features $F^{2D}$. In the forecasting module, $F^{2D}$ are embedded as $\text{Scale}(4,256)\times\text{Views}(M,256)\times\text{Temporal}(4,256)$. The future state query from $F^{2D}_t$ has shape $(H/16)\times(W/16)\times M\times256$. Each future interaction layer shares weights and contains two multiheaded self-attention blocks, a feedforward network, and a future state synthesizer $(\text{linear}+\text{ReLU})\times2,\ \text{linear}$ with embedding dimension 256 and 16 heads. Depth and context distributions have sizes $M\times88\times H/16\times W/16$ and $M\times64\times H/16\times W/16$, yielding a 3D feature volume $64\times16\times200\times200$. A 3D ResNet occupancy encoder and a semantic predictor output voxelwise logits of size $N_C\times16\times200\times200$.
  • Figure 3: Qualitative comparison of predictions from our proposed ForecastOcc model with the second-best baseline $I^{2}$-World on the Occ3D-nuScenes dataset. We show the multi-view camera images corresponding to the future frame at $T + k$ and the semantic occupancy ground truth (GT).
  • Figure 4: Qualitative comparison of predictions from our proposed ForecastOcc model with the second-best baseline PDCast hurtado2024panoptic on the SemanticKITTI dataset. We show the image corresponding to the future frame at $T + k$ and the semantic occupancy ground truth (GT).