ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Jinke Li; Xiao He; Chonghua Zhou; Xiaoqiang Cheng; Yang Wen; Dan Zhang

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Jinke Li, Xiao He, Chonghua Zhou, Xiaoqiang Cheng, Yang Wen, Dan Zhang

TL;DR

ViewFormer addresses the limitations of projection-first multi-view fusion for 3D occupancy by introducing learning-first view attention and a streaming temporal attention mechanism. This enables robust spatiotemporal aggregation of multi-view features to predict 3D occupancy and occupancy flow, while FlowOcc3D provides a fine-grained motion benchmark. Across Occ3D and OpenOcc, ViewFormer achieves state-of-the-art performance with rapid convergence, and FlowOcc3D demonstrates the value of occupancy-level flow for dynamic scenes. The approach offers a scalable, vision-centric framework for accurate, temporally-consistent 3D perception in autonomous driving with practical implications for map construction and motion-aware perception.

Abstract

3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, including map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes are available at \url{https://github.com/ViewFormerOcc/ViewFormer-Occ}.

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 9 figures, 9 tables)

This paper contains 22 sections, 8 equations, 9 figures, 9 tables.

Introduction
Related Work
Methodology
View Attention
Streaming Temporal Attention
Optimization
Occupancy Flow Generation
Loss Function
Implementation Details
Evaluation
Datasets
Main Results
Ablations and Analysis
Streaming Temporal Attention.
Qualitative Results
...and 7 more sections

Figures (9)

Figure 1: Treating objects simply as 3D boxes lacks the sense of the background, such as the suitcase in (a). Defining 3D space as occupancies (b) and (c) is more effective in representing objects. Beyond static occupancy, occupancy flow is crucial to perceive dynamic scenes. In the case of a turning car in (d), different flow directions of occupancies can be clearly observed.
Figure 2: Constrained by fixed reference points, the projection-first method (a) introduced in DBLP:conf/eccv/LiWLXSLQD22 fails to collect multi-view features. In contrast, our learning-first view attention (b) gathers features from multiple cameras more adequately.
Figure 3: ViewFormer pipeline. In our ViewFormer, the multi-view features $F_t$ are first extracted from the multiple images via a backbone. Then we introduce the view attention specific for addressing the limitations of the existing projection-first method, allowing us to aggregate multi-view features for voxels $V^{\prime}_{t}$ more adequately. In our streaming temporal attention, we squeeze the voxel queries $V^{\prime}_{t}$ into the BEV queries $B_t$ with concern of the computing complexity. Each BEV cell of $B_t$ interacts with historical multi-frame BEV features stored in the streaming memory queue, where we utilize ego transformation to compensate ego motion. The voxels $V_{t}$ obtained from unsqueezing the updated BEV features are subsequently fed into 3D occupancy and occupancy flow prediction. We push the updated BEV queries into the memory queue for subsequent temporal interaction in the video stream pipeline.
Figure 4: Occupancy flow vs. object flow. Object flow assigns only a single flow vector to the entire object as in (a) and (c), while occupancy flow provides finer-grained flow vectors for all occupancy grids as in (b) and (d), where the color and brightness represent the flow direction and magnitude respectively.
Figure 5: Visualization on ViewAttn..
...and 4 more figures

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

TL;DR

Abstract

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)