Table of Contents
Fetching ...

Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

Xiaoyang Wu, Xiang Xu, Lingdong Kong, Liang Pan, Ziwei Liu, Tong He, Wanli Ouyang, Hengshuang Zhao

TL;DR

The paper tackles semantic segmentation on dense 3D LiDAR data from the Waymo Open Dataset and introduces Point Transformer V3 Extreme, an enhanced PTv3 variant. It combines multi-frame training, a no-clipping-point policy, space-filling curve-based tokenization, and a simple model ensemble to boost performance, achieving validation mIoU improvements from 72.1% to 74.8% and test mIoU improvements from 70.7% to 72.8%. Key innovations include structured tokenization of unstructured point clouds, efficient attention mechanisms, and a practical ensemble strategy. The approach secures first place on the semantic segmentation leaderboard and demonstrates the effectiveness of transformer-based 3D perception for autonomous driving, offering a scalable path for future research and deployment.

Abstract

In this technical report, we detail our first-place solution for the 2024 Waymo Open Dataset Challenge's semantic segmentation track. We significantly enhanced the performance of Point Transformer V3 on the Waymo benchmark by implementing cutting-edge, plug-and-play training and inference technologies. Notably, our advanced version, Point Transformer V3 Extreme, leverages multi-frame training and a no-clipping-point policy, achieving substantial gains over the original PTv3 performance. Additionally, employing a straightforward model ensemble strategy further boosted our results. This approach secured us the top position on the Waymo Open Dataset semantic segmentation leaderboard, markedly outperforming other entries.

Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

TL;DR

The paper tackles semantic segmentation on dense 3D LiDAR data from the Waymo Open Dataset and introduces Point Transformer V3 Extreme, an enhanced PTv3 variant. It combines multi-frame training, a no-clipping-point policy, space-filling curve-based tokenization, and a simple model ensemble to boost performance, achieving validation mIoU improvements from 72.1% to 74.8% and test mIoU improvements from 70.7% to 72.8%. Key innovations include structured tokenization of unstructured point clouds, efficient attention mechanisms, and a practical ensemble strategy. The approach secures first place on the semantic segmentation leaderboard and demonstrates the effectiveness of transformer-based 3D perception for autonomous driving, offering a scalable path for future research and deployment.

Abstract

In this technical report, we detail our first-place solution for the 2024 Waymo Open Dataset Challenge's semantic segmentation track. We significantly enhanced the performance of Point Transformer V3 on the Waymo benchmark by implementing cutting-edge, plug-and-play training and inference technologies. Notably, our advanced version, Point Transformer V3 Extreme, leverages multi-frame training and a no-clipping-point policy, achieving substantial gains over the original PTv3 performance. Additionally, employing a straightforward model ensemble strategy further boosted our results. This approach secured us the top position on the Waymo Open Dataset semantic segmentation leaderboard, markedly outperforming other entries.
Paper Structure (5 sections, 4 figures, 3 tables)

This paper contains 5 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of Point Transformer V3 (PTv3). Compared to its predecessor, PTv2 wu2022point, our PTv3 shows superiority in the following aspects: 1. Stronger performance. PTv3 achieves state-of-the-art results across a variety of indoor and outdoor 3D perception tasks. 2. Wider receptive field. Benefit from the simplicity and efficiency, PTv3 expands the receptive field from 16 to 1024 points. 3. Faster speed. PTv3 significantly increases processing speed, making it suitable for latency-sensitive applications. 4. Lower Memory Consumption. PTv3 reduces memory usage, enhancing accessibility for broader situations.
  • Figure 2: Patch grouping. (a) Reordering point cloud according to order derived from a specific serialization pattern. (b) Padding point cloud sequence by borrowing points from neighboring patches to ensure it is divisible by the designated patch size.
  • Figure 4: Overall architecture.
  • Figure 5: Visualization of Multi-frames Concatenation.