Table of Contents
Fetching ...

Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation

Xuezhi Xiang, Xi Wang, Lei Zhang, Denis Ombati, Himaloy Himu, Xiantong Zhen

TL;DR

The paper tackles 3D scene flow estimation from consecutive point clouds by addressing the trade-off between maintaining fine geometric details and modeling long-range dependencies. It introduces a point-voxel fusion network that combines a detailed point branch (PointNet++-based SetConv) with a voxel branch (sparse grid attention and shift-window) and augments both with Umbrella Surface Feature Extraction to encode local surface geometry. The approach uses a SCOOP-inspired matching with optimal transport for correspondences and a refinement stage, achieving state-of-the-art results among self-supervised methods on KITTI and FlyingThings3D, with substantial EPE reductions ($8.51\%$ on KITTI_o and $10.52\%$ on KITTI_s) and competitive performance against fully supervised methods. This work demonstrates that explicit surface geometry modeling plus cross-scale, sparse attention-driven fusion can significantly improve 3D motion estimation in cluttered real-world scenes, reducing the gap to fully supervised approaches in practical settings.

Abstract

Scene flow estimation aims to generate the 3D motion field of points between two consecutive frames of point clouds, which has wide applications in various fields. Existing point-based methods ignore the irregularity of point clouds and have difficulty capturing long-range dependencies due to the inefficiency of point-level computation. Voxel-based methods suffer from the loss of detail information. In this paper, we propose a point-voxel fusion method, where we utilize a voxel branch based on sparse grid attention and the shifted window strategy to capture long-range dependencies and a point branch to capture fine-grained features to compensate for the information loss in the voxel branch. In addition, since xyz coordinates are difficult to describe the geometric structure of complex 3D objects in the scene, we explicitly encode the local surface information of the point cloud through the umbrella surface feature extraction (USFE) module. We verify the effectiveness of our method by conducting experiments on the Flyingthings3D and KITTI datasets. Our method outperforms all other self-supervised methods and achieves highly competitive results compared to fully supervised methods. We achieve improvements in all metrics, especially EPE, which is reduced by 8.51% on the KITTIo dataset and 10.52% on the KITTIs dataset, respectively.

Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation

TL;DR

The paper tackles 3D scene flow estimation from consecutive point clouds by addressing the trade-off between maintaining fine geometric details and modeling long-range dependencies. It introduces a point-voxel fusion network that combines a detailed point branch (PointNet++-based SetConv) with a voxel branch (sparse grid attention and shift-window) and augments both with Umbrella Surface Feature Extraction to encode local surface geometry. The approach uses a SCOOP-inspired matching with optimal transport for correspondences and a refinement stage, achieving state-of-the-art results among self-supervised methods on KITTI and FlyingThings3D, with substantial EPE reductions ( on KITTI_o and on KITTI_s) and competitive performance against fully supervised methods. This work demonstrates that explicit surface geometry modeling plus cross-scale, sparse attention-driven fusion can significantly improve 3D motion estimation in cluttered real-world scenes, reducing the gap to fully supervised approaches in practical settings.

Abstract

Scene flow estimation aims to generate the 3D motion field of points between two consecutive frames of point clouds, which has wide applications in various fields. Existing point-based methods ignore the irregularity of point clouds and have difficulty capturing long-range dependencies due to the inefficiency of point-level computation. Voxel-based methods suffer from the loss of detail information. In this paper, we propose a point-voxel fusion method, where we utilize a voxel branch based on sparse grid attention and the shifted window strategy to capture long-range dependencies and a point branch to capture fine-grained features to compensate for the information loss in the voxel branch. In addition, since xyz coordinates are difficult to describe the geometric structure of complex 3D objects in the scene, we explicitly encode the local surface information of the point cloud through the umbrella surface feature extraction (USFE) module. We verify the effectiveness of our method by conducting experiments on the Flyingthings3D and KITTI datasets. Our method outperforms all other self-supervised methods and achieves highly competitive results compared to fully supervised methods. We achieve improvements in all metrics, especially EPE, which is reduced by 8.51% on the KITTIo dataset and 10.52% on the KITTIs dataset, respectively.

Paper Structure

This paper contains 9 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our point-voxel fusion scene flow estimation network. The source frame and the target frame are input into the point-voxel fusion module to extract the deep features of the two frames of point clouds, respectively, and the weights are shared.
  • Figure 2: Sparse Grid Attention. The values and coordinates of the non-empty voxels are stored in a 3D hash table, and then the coordinates are converted into index values as the Key in the attention calculation process.
  • Figure 3: Visual comparison on KITTI dataset.