Table of Contents
Fetching ...

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

TL;DR

QueryOcc tackles the challenge of learning 3D semantic occupancy from images without manual 3D labels by introducing 4D query-based supervision across adjacent frames. It combines a contractive BEV representation with lift-contract-splat lifting and a unified 4D decoder to predict occupancy and semantics directly at arbitrary 4D points $\mathbf{q}=[x,y,z,t]^\top$, enabling long-range reasoning under constant memory. The approach supports supervision from pseudo-point clouds derived from vision foundation models or raw lidar data, and achieves state-of-the-art performance among self-supervised camera-based methods on Occ3D-nuScenes, with 11.6 FPS on A100 hardware. The key contributions include the QueryOcc framework, a contractive BEV representation for unbounded scenes, and an effective 4D query-based supervision strategy that outperforms rendering-based and voxelized baselines. These results demonstrate the practicality and scalability of direct 4D supervision for large-scale self-supervised 3D scene understanding in autonomous driving contexts.

Abstract

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

TL;DR

QueryOcc tackles the challenge of learning 3D semantic occupancy from images without manual 3D labels by introducing 4D query-based supervision across adjacent frames. It combines a contractive BEV representation with lift-contract-splat lifting and a unified 4D decoder to predict occupancy and semantics directly at arbitrary 4D points , enabling long-range reasoning under constant memory. The approach supports supervision from pseudo-point clouds derived from vision foundation models or raw lidar data, and achieves state-of-the-art performance among self-supervised camera-based methods on Occ3D-nuScenes, with 11.6 FPS on A100 hardware. The key contributions include the QueryOcc framework, a contractive BEV representation for unbounded scenes, and an effective 4D query-based supervision strategy that outperforms rendering-based and voxelized baselines. These results demonstrate the practicality and scalability of direct 4D supervision for large-scale self-supervised 3D scene understanding in autonomous driving contexts.

Abstract

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

Paper Structure

This paper contains 26 sections, 8 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: QueryOcc learns to produce continuous 3D semantic occupancy from images through direct spatio-temporal query supervision from sequential frames. We outperform prior methods by 26% in semantic RayIoU while maintaining real-time inference at 11.6 FPS.
  • Figure 2: Overview of QueryOcc. Multi-view camera images are encoded and lifted to BEV via our lift-contract-splat module, combining geometric encoding, log-linear depth bins, and an axis-aligned BEV contraction. The BEV features form a spatially grounded representation from which a unified decoder predicts occupancy, semantics, or distilled vision foundation model features for continuous queries.
  • Figure 3: PCA visualization of lifted features $\mathcal{G}$ and BEV features $\mathcal{Z}$. The proposed BEV contraction and point encoding enable efficient modeling of large scenes, and the learned BEV representations exhibit structured separation between occupied, occluded, and free-space areas.
  • Figure 4: Overview of the self-supervised supervision process for a camera-only setup. Adjacent frames provide supervision through pseudo point clouds from VFM-predicted depth, semantic pseudo-labels, or features. These 3D points generate positive and negative 4D queries used to supervise occupancy, semantics, and feature distillation. The framework can optionally be complemented by lidar point cloud.
  • Figure 5: Effect of temporal window. Supervising across multiple timesteps improves geometric priors. Forward and backward supervision perform better than just forward.
  • ...and 9 more figures