Memory-based Adapters for Online 3D Scene Perception

Xiuwei Xu; Chong Xia; Ziwei Wang; Linqing Zhao; Yueqi Duan; Jie Zhou; Jiwen Lu

Memory-based Adapters for Online 3D Scene Perception

Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, Jiwen Lu

TL;DR

This work tackles online 3D scene perception from streaming RGB-D data, where traditional offline models struggle due to lack of temporal context. It introduces memory-based adapters that cache and aggregate features across time: a 3D voxel memory $m^P_t$ for point clouds and an image memory $m^I_t$, augmented by a 3D-to-2D adapter to bring global 3D context into image features. The approach is plug-and-play, requiring only finetuning on RGB-D videos, and demonstrates leading performance on ScanNet and SceneNN for semantic segmentation, object detection, and instance segmentation. By enabling existing offline architectures to operate online without task-specific designs or extra losses, this method offers a practical route for real-time robotic perception.

Abstract

In this paper, we propose a new framework for online 3D scene perception. Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input, which is not applicable in robotic applications where the input data is streaming RGB-D videos rather than a complete 3D scene reconstructed from pre-collected RGB-D videos. To deal with online 3D scene perception tasks where data collection and perception should be performed simultaneously, the model should be able to process 3D scenes frame by frame and make use of the temporal information. To this end, we propose an adapter-based plug-and-play module for the backbone of 3D scene perception model, which constructs memory to cache and aggregate the extracted RGB-D features to empower offline models with temporal learning ability. Specifically, we propose a queued memory mechanism to cache the supporting point cloud and image features. Then we devise aggregation modules which directly perform on the memory and pass temporal information to current frame. We further propose 3D-to-2D adapter to enhance image features with strong global context. Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate our approach achieves leading performance on three 3D scene perception tasks compared with state-of-the-art online methods by simply finetuning existing offline models, without any model and task-specific designs. \href{https://xuxw98.github.io/Online3D/}{Project page}.

Memory-based Adapters for Online 3D Scene Perception

TL;DR

for point clouds and an image memory

, augmented by a 3D-to-2D adapter to bring global 3D context into image features. The approach is plug-and-play, requiring only finetuning on RGB-D videos, and demonstrates leading performance on ScanNet and SceneNN for semantic segmentation, object detection, and instance segmentation. By enabling existing offline architectures to operate online without task-specific designs or extra losses, this method offers a practical route for real-time robotic perception.

Abstract

Paper Structure (15 sections, 7 equations, 6 figures, 11 tables)

This paper contains 15 sections, 7 equations, 6 figures, 11 tables.

Introduction
Related Work
Approach
Online 3D Scene Perception
Temporal Modeling for Point Clouds
Temporal Modeling for Images
Inter-modal Temporal Modeling
Experiment
Benchmarks and Implementation Details
Comparison with State-of-the-art
Ablation Study
Conclusion
Detailed Architecture
Training Hyperparameters
Class-specific Results

Figures (6)

Figure 1: We propose a general framework for online 3D scene perception. With the presented memory-based adapters, we empower existing offline models in different tasks with online perception ability, which is valuable for robotics applications.
Figure 2: Overall architecture of our approach. We insert memory-based adapters after image and point cloud backbones, which cache the extracted features in memory over time and perform temporal aggregation. 3D-to-2D adapter is proposed to further exploit inter-modal temporal information. Solid lines indicate operations within a single frame, while dashed lines indicate temporal operations.
Figure 3: The architecture of the memory-based adapter for point cloud features. We cache and aggregate the features in a queue of 3D voxel grids. Gray, green, yellow and red block refer to previous, current, updated and aggregated voxel features.
Figure 4: The architecture of the memory-based adapter for image features. We reorganize the input features and shift out a proportion of channels into the memory, while shifting in previous memory and aggregating temporal information by 2D convolution. We also resort to the 3D memory for more global context.
Figure 5: Visualization results on the online benchmark. Our predictions are accurate and robust to the number of frames. Note that some ground-truth masks are incomplete due to the noisy 2D annotations, in this case our predictions are more reasonable than the ground-truths.
...and 1 more figures

Memory-based Adapters for Online 3D Scene Perception

TL;DR

Abstract

Memory-based Adapters for Online 3D Scene Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (6)