Table of Contents
Fetching ...

Stream and Query-guided Feature Aggregation for Efficient and Effective 3D Occupancy Prediction

Seokha Moon, Janghyun Baek, Giseop Kim, Jinkyu Kim, Sunwook Choi

TL;DR

DuOcc tackles the accuracy–efficiency trade-off in 3D occupancy prediction by introducing a dual aggregation framework that preserves dense voxel geometry while remaining computationally efficient. StreamAgg accumulates voxel features over time with motion-aware warping and lightweight refinement, while QueryAgg injects instance-level dynamic object information via deformable attention and selective aggregation. The combined approach yields state-of-the-art results on Occ3D-nuScenes and SurroundOcc under real-time constraints, with substantial memory savings. This work advances practical 3D scene understanding for autonomous driving by enabling high-fidelity occupancy maps with efficient, real-time processing.

Abstract

3D occupancy prediction has become a key perception task in autonomous driving, as it enables comprehensive scene understanding. Recent methods enhance this understanding by incorporating spatiotemporal information through multi-frame fusion, but they suffer from a trade-off: dense voxel-based representations provide high accuracy at significant computational cost, whereas sparse representations improve efficiency but lose spatial detail. To mitigate this trade-off, we introduce DuOcc, which employs a dual aggregation strategy that retains dense voxel representations to preserve spatial fidelity while maintaining high efficiency. DuOcc consists of two key components: (i) Stream-based Voxel Aggregation, which recurrently accumulates voxel features over time and refines them to suppress warping-induced distortions, preserving a clear separation between occupied and free space. (ii) Query-guided Aggregation, which complements the limitations of voxel accumulation by selectively injecting instance-level query features into the voxel regions occupied by dynamic objects. Experiments on the widely used Occ3D-nuScenes and SurroundOcc datasets demonstrate that DuOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by over 40% compared to prior methods.

Stream and Query-guided Feature Aggregation for Efficient and Effective 3D Occupancy Prediction

TL;DR

DuOcc tackles the accuracy–efficiency trade-off in 3D occupancy prediction by introducing a dual aggregation framework that preserves dense voxel geometry while remaining computationally efficient. StreamAgg accumulates voxel features over time with motion-aware warping and lightweight refinement, while QueryAgg injects instance-level dynamic object information via deformable attention and selective aggregation. The combined approach yields state-of-the-art results on Occ3D-nuScenes and SurroundOcc under real-time constraints, with substantial memory savings. This work advances practical 3D scene understanding for autonomous driving by enabling high-fidelity occupancy maps with efficient, real-time processing.

Abstract

3D occupancy prediction has become a key perception task in autonomous driving, as it enables comprehensive scene understanding. Recent methods enhance this understanding by incorporating spatiotemporal information through multi-frame fusion, but they suffer from a trade-off: dense voxel-based representations provide high accuracy at significant computational cost, whereas sparse representations improve efficiency but lose spatial detail. To mitigate this trade-off, we introduce DuOcc, which employs a dual aggregation strategy that retains dense voxel representations to preserve spatial fidelity while maintaining high efficiency. DuOcc consists of two key components: (i) Stream-based Voxel Aggregation, which recurrently accumulates voxel features over time and refines them to suppress warping-induced distortions, preserving a clear separation between occupied and free space. (ii) Query-guided Aggregation, which complements the limitations of voxel accumulation by selectively injecting instance-level query features into the voxel regions occupied by dynamic objects. Experiments on the widely used Occ3D-nuScenes and SurroundOcc datasets demonstrate that DuOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by over 40% compared to prior methods.

Paper Structure

This paper contains 25 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Existing multi-frame fusion based methods leverage temporal information by processing multiple frames. (a) Methods using dense representations cotrpanoocc preserve spatiotemporal detail but incur high computational cost, and (b) methods based on sparse representations sparseoccopusgsdoccgaussianformer_v1gaussianformer_v2 are more efficient yet inevitably lose spatial information. In contrast, (c) our proposed DuOcc maintains dense voxel representations through stream-based and query-guided aggregation, achieving low cost and high spatial fidelity.
  • Figure 2: Overview of DuOcc, which predicts the 3D occupancy state of each voxel in a streaming manner with dual aggregation strategy. First, voxel features are recurrently accumulated over time through Stream-based Voxel Feature Aggregation (StreamAgg), which efficiently handles features of stationary objects. These temporally aggregated features are then further refined using Query-guided Aggregation (QueryAgg), which utilizes instance queries that encode fine-grained, instance-level features to enhance the representation of non-stationary (dynamic) objects.
  • Figure 3: Overview of the RefineNet We propose RefineNet to mitigate warping-induced distortions and emphasize object-occupied regions in voxel space. It refines warped features using BottleneckConv3D with 3D channel and spatial attention, selectively enhancing meaningful voxels while suppressing noise. To ensure reliable learning, two auxiliary supervision heads, occupied mask and forecasting, are used only at training time.
  • Figure 4: Examples of 3D occupancy prediction results from our proposed method (DuOcc) and GSD-Occ. The corresponding input image and ground-truth 3D occupancy are also provided. Our method demonstrates notably improved prediction quality for both dynamic objects (1st and 2nd rows) and static objects (3rd row). Refer to the dotted circles for a detailed comparison of prediction outputs. More examples are provided in the supplemental material.
  • Figure 5: Effect of the QueryAgg, particularly for dynamic objects that are distant, overlapped or occluded. The third column visualizes detection outputs from instance queries used in QueryAgg.
  • ...and 7 more figures