Table of Contents
Fetching ...

Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

Dubing Chen, Huan Zheng, Jin Fang, Xingping Dong, Xianfei Li, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen

TL;DR

This work addresses the underexplored area of temporal fusion in vision-based 3D semantic occupancy prediction (VisionOcc) by identifying three key temporal cues—scene-level consistency, motion calibration, and geometric complementation—and proposing GDFusion, a gradient-descent-inspired RNN framework that unifies multi-representation fusion. By reinterpretating vanilla RNNs as optimization steps, the approach fuses voxel-level features with scene-adaptive parameters, motion maps, and geometry priors in a memory-efficient, streaming manner. Empirical results on Occ3D, SurroundOcc, and OpenOccupancy show consistent improvements in mIoU (e.g., 1.4%-4.8% on Occ3D) and substantial memory reductions (27%-72%), with negligible inference overhead. The method is plug-and-play across VisionOcc baselines and offers broad potential for dynamic scene understanding in autonomous driving and robotics.

Abstract

We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4\%-4.8\% mIoU improvements and reduces memory consumption by 27\%-72\%.

Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

TL;DR

This work addresses the underexplored area of temporal fusion in vision-based 3D semantic occupancy prediction (VisionOcc) by identifying three key temporal cues—scene-level consistency, motion calibration, and geometric complementation—and proposing GDFusion, a gradient-descent-inspired RNN framework that unifies multi-representation fusion. By reinterpretating vanilla RNNs as optimization steps, the approach fuses voxel-level features with scene-adaptive parameters, motion maps, and geometry priors in a memory-efficient, streaming manner. Empirical results on Occ3D, SurroundOcc, and OpenOccupancy show consistent improvements in mIoU (e.g., 1.4%-4.8% on Occ3D) and substantial memory reductions (27%-72%), with negligible inference overhead. The method is plug-and-play across VisionOcc baselines and offers broad potential for dynamic scene understanding in autonomous driving and robotics.

Abstract

We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4\%-4.8\% mIoU improvements and reduces memory consumption by 27\%-72\%.

Paper Structure

This paper contains 26 sections, 1 theorem, 46 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

The RNN update step $h^t = Ah^{t-1} + Bx^t$ is equivalent to a gradient descent step on $h^{t-1}$ minimizing the loss function $\mathcal{L}^t = \|Ah^{t-1} - Bx^t\|^2$.

Figures (4)

  • Figure 1: Motivation behind the proposed temporal fusion.(a): VisionOcc pipeline. (b): Proposed temporal cues, showing historical motion and geometric data enhancing current viewpoints, with scene consistency priors from historical information.
  • Figure 2: Multi-level temporal fusion in the VisionOcc pipeline. Volume features $\mathbf{V}^t$, geometry $\mathbf{G}^t$, motion $\mathbf{M}^t$, and scene-adaptive parameters $\mathbf{S}^t$ are enhanced through RNN-style temporal fusion, each capturing distinct temporal dynamics. Single-frame-sized historical states $\mathbf{H}_v^{t-1}$, $\mathbf{H}_g^{t-1}$, $\mathbf{H}_m^{t-1}$, and $\mathbf{H}_s^{t-1}$ are stored in memory and updated frame-by-frame.
  • Figure 3: Update dynamics of gradient descent-based temporal fusion pipeline.$\mathbf{f}^t$ denotes the (geometry, motion, voxel-level, scene-level) feature of the current frame. $\mathbf{H}^{t-1}$ and $\mathbf{H}^{t}$ represent the prior and current historical states, respectively.
  • Figure A.4: Qualitative comparison between BEVDetOcc-SF, FBOcc, and ALOcc, each enhanced by our GDFusion method on Occ3D. The top row in the leftmost column shows the input images, presented in the following order: camera front left, camera front, camera front right, camera back left, camera back, and camera back right. The bottom row in the leftmost column displays the ground-truth semantic occupancy. The middle section illustrates the results of the three baselines, while the rightmost column presents the results after incorporating our method. Key areas are highlighted with red boxes for emphasis.

Theorems & Definitions (2)

  • Proposition 1
  • proof