Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation
Xuezhi Xiang, Xi Wang, Lei Zhang, Denis Ombati, Himaloy Himu, Xiantong Zhen
TL;DR
The paper tackles 3D scene flow estimation from consecutive point clouds by addressing the trade-off between maintaining fine geometric details and modeling long-range dependencies. It introduces a point-voxel fusion network that combines a detailed point branch (PointNet++-based SetConv) with a voxel branch (sparse grid attention and shift-window) and augments both with Umbrella Surface Feature Extraction to encode local surface geometry. The approach uses a SCOOP-inspired matching with optimal transport for correspondences and a refinement stage, achieving state-of-the-art results among self-supervised methods on KITTI and FlyingThings3D, with substantial EPE reductions ($8.51\%$ on KITTI_o and $10.52\%$ on KITTI_s) and competitive performance against fully supervised methods. This work demonstrates that explicit surface geometry modeling plus cross-scale, sparse attention-driven fusion can significantly improve 3D motion estimation in cluttered real-world scenes, reducing the gap to fully supervised approaches in practical settings.
Abstract
Scene flow estimation aims to generate the 3D motion field of points between two consecutive frames of point clouds, which has wide applications in various fields. Existing point-based methods ignore the irregularity of point clouds and have difficulty capturing long-range dependencies due to the inefficiency of point-level computation. Voxel-based methods suffer from the loss of detail information. In this paper, we propose a point-voxel fusion method, where we utilize a voxel branch based on sparse grid attention and the shifted window strategy to capture long-range dependencies and a point branch to capture fine-grained features to compensate for the information loss in the voxel branch. In addition, since xyz coordinates are difficult to describe the geometric structure of complex 3D objects in the scene, we explicitly encode the local surface information of the point cloud through the umbrella surface feature extraction (USFE) module. We verify the effectiveness of our method by conducting experiments on the Flyingthings3D and KITTI datasets. Our method outperforms all other self-supervised methods and achieves highly competitive results compared to fully supervised methods. We achieve improvements in all metrics, especially EPE, which is reduced by 8.51% on the KITTIo dataset and 10.52% on the KITTIs dataset, respectively.
