Table of Contents
Fetching ...

Deflickering Vision-Based Occupancy Networks through Lightweight Spatio-Temporal Correlation

Fengcheng Yu, Haoran Xu, Canming Xia, Ziyang Zong, Guang Tan

TL;DR

This work tackles flickering in vision-based occupancy networks (VONs) used for autonomous driving by introducing OccLinker, a lightweight plug-in that leverages historical static cues and high-frequency motion information. OccLinker tokenizes static and motion features into sparse representations and applies dual cross-attention to learn compact latent correlations with the current frame, producing a correction term that refines the base VON predictions without retraining the backbone. The authors demonstrate that OccLinker improves both spatial occupancy accuracy (IoU/mIoU) and temporal consistency across two benchmarks (SurroundOcc and Occ3D) with minimal overhead, achieving favorable accuracy-efficiency trade-offs. The method is modular and compatible with existing VONs, offering a practical path to deflicker occupancy predictions in real-time autonomous systems.

Abstract

Vision-based occupancy networks (VONs) provide an end-to-end solution for reconstructing 3D environments in autonomous driving. However, existing methods often suffer from temporal inconsistencies, manifesting as flickering effects that degrade temporal coherence and adversely affect downstream decision-making. While recent approaches incorporate historical information to alleviate this issue, they often incur high computational costs and may introduce misaligned or redundant features that interfere with object detection. We propose OccLinker, a novel plugin framework that can be easily integrated into existing VONs to improve performance. Our method efficiently consolidates historical static and motion cues, learns sparse latent correlations with current features through a dual cross-attention mechanism, and generates correction occupancy components to refine the base network predictions. In addition, we introduce a new temporal consistency metric to quantitatively measure flickering effects. Extensive experiments on two benchmark datasets demonstrate that our method achieves superior performance with minimal computational overhead while effectively reducing flickering artifacts.

Deflickering Vision-Based Occupancy Networks through Lightweight Spatio-Temporal Correlation

TL;DR

This work tackles flickering in vision-based occupancy networks (VONs) used for autonomous driving by introducing OccLinker, a lightweight plug-in that leverages historical static cues and high-frequency motion information. OccLinker tokenizes static and motion features into sparse representations and applies dual cross-attention to learn compact latent correlations with the current frame, producing a correction term that refines the base VON predictions without retraining the backbone. The authors demonstrate that OccLinker improves both spatial occupancy accuracy (IoU/mIoU) and temporal consistency across two benchmarks (SurroundOcc and Occ3D) with minimal overhead, achieving favorable accuracy-efficiency trade-offs. The method is modular and compatible with existing VONs, offering a practical path to deflicker occupancy predictions in real-time autonomous systems.

Abstract

Vision-based occupancy networks (VONs) provide an end-to-end solution for reconstructing 3D environments in autonomous driving. However, existing methods often suffer from temporal inconsistencies, manifesting as flickering effects that degrade temporal coherence and adversely affect downstream decision-making. While recent approaches incorporate historical information to alleviate this issue, they often incur high computational costs and may introduce misaligned or redundant features that interfere with object detection. We propose OccLinker, a novel plugin framework that can be easily integrated into existing VONs to improve performance. Our method efficiently consolidates historical static and motion cues, learns sparse latent correlations with current features through a dual cross-attention mechanism, and generates correction occupancy components to refine the base network predictions. In addition, we introduce a new temporal consistency metric to quantitatively measure flickering effects. Extensive experiments on two benchmark datasets demonstrate that our method achieves superior performance with minimal computational overhead while effectively reducing flickering artifacts.

Paper Structure

This paper contains 36 sections, 15 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Enhancing existing VONs with OccLinker. (i) In the upper part, the first row shows three consecutive images where the ego-vehicle moves forward along the right side of the road. The blue box highlights a pedestrian as a sparse visual cue, and the second row provides the corresponding zoomed-in views. (ii) In the lower part, we compare different types of VON methods. The first row shows a standard 3D VON surroundOcc that ignores historical information, resulting in the pedestrian being missed in certain frames. The second row presents a history-aware VON bevdet4d that utilizes past frames but still produces incomplete voxels due to suboptimal temporal association. In contrast, our module, when integrated with the 3D VON surroundOcc, achieves smoother and more complete occupancy predictions through effective spatio-temporal correlation.
  • Figure 2: Comparison with history-aware VON methods in terms of mIoU and running costs. Points closer to the bottom-left indicate better efficiency, with lower memory usage and faster inference. Larger circles denote higher prediction quality with mIoU shown in numbers. Our method, when integrated with ViewFormer, achieves the best mIoU and the lowest memory consumption, while maintaining competitive inference latency.
  • Figure 3: Overview of OccLinker. A frozen base 3D VON surroundOccMonoSceneviewformer runs at a fixed frequency to extract static texture features by $\Phi_{\rm SE}^*$ and generate initial occupancy prediction by $\Phi_{\rm SD}^*$. We employ $\Phi_{\rm ME}$ for extracting motion features of intermediate frame differences, and OccLinker uses the static and motion features from the recent temporal window as queries, and the current static feature as the key and value. Through lightweight tokenization encoder $\phi$ and dual cross-attention module $\Theta$, it constructs spatio-temporal correlations and outputs a correction term $\Delta O_t$ to refine the initial prediction.
  • Figure 4: Correlation pipeline in OccLinker. Using $L=1$ as an example, OccLinker explicitly separates static and dynamic correlations by treating current static features as keys and values, while using historical static features and intermediate motion features as queries.
  • Figure 5: Comparison in a T-junction scenario with partial and dynamic pedestrian occlusions. SurroundOcc+OccLinker produces robust predictions, with the pedestrian being consistently tracked, while SOTA methods show a flickering phenomenon.
  • ...and 2 more figures