Table of Contents
Fetching ...

LPFormer: LiDAR Pose Estimation Transformer with Multi-Task Network

Dongqiangzi Ye, Yufei Xie, Weijia Chen, Zixiang Zhou, Lingting Ge, Hassan Foroosh

TL;DR

LPFormer consists of two stages: firstly, it identifies the human bounding box and extracts multi-level feature representations, and secondly, it utilizes a transformer-based network to predict human keypoints based on these features.

Abstract

Due to the difficulty of acquiring large-scale 3D human keypoint annotation, previous methods for 3D human pose estimation (HPE) have often relied on 2D image features and sequential 2D annotations. Furthermore, the training of these networks typically assumes the prediction of a human bounding box and the accurate alignment of 3D point clouds with 2D images, making direct application in real-world scenarios challenging. In this paper, we present the 1st framework for end-to-end 3D human pose estimation, named LPFormer, which uses only LiDAR as its input along with its corresponding 3D annotations. LPFormer consists of two stages: firstly, it identifies the human bounding box and extracts multi-level feature representations, and secondly, it utilizes a transformer-based network to predict human keypoints based on these features. Our method demonstrates that 3D HPE can be seamlessly integrated into a strong LiDAR perception network and benefit from the features extracted by the network. Experimental results on the Waymo Open Dataset demonstrate the state-of-the-art performance, and improvements even compared to previous multi-modal solutions.

LPFormer: LiDAR Pose Estimation Transformer with Multi-Task Network

TL;DR

LPFormer consists of two stages: firstly, it identifies the human bounding box and extracts multi-level feature representations, and secondly, it utilizes a transformer-based network to predict human keypoints based on these features.

Abstract

Due to the difficulty of acquiring large-scale 3D human keypoint annotation, previous methods for 3D human pose estimation (HPE) have often relied on 2D image features and sequential 2D annotations. Furthermore, the training of these networks typically assumes the prediction of a human bounding box and the accurate alignment of 3D point clouds with 2D images, making direct application in real-world scenarios challenging. In this paper, we present the 1st framework for end-to-end 3D human pose estimation, named LPFormer, which uses only LiDAR as its input along with its corresponding 3D annotations. LPFormer consists of two stages: firstly, it identifies the human bounding box and extracts multi-level feature representations, and secondly, it utilizes a transformer-based network to predict human keypoints based on these features. Our method demonstrates that 3D HPE can be seamlessly integrated into a strong LiDAR perception network and benefit from the features extracted by the network. Experimental results on the Waymo Open Dataset demonstrate the state-of-the-art performance, and improvements even compared to previous multi-modal solutions.
Paper Structure (14 sections, 3 equations, 5 figures, 5 tables)

This paper contains 14 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our method can predict 3D keypoints (red points with yellow wireframes), 3D bounding boxes, and 3D semantic segmentation in a single framework.
  • Figure 2: Main Architecture of LPFormer. Our network aims to estimate the 3D human pose for the entire frame based on the LiDAR-only input. It is comprised of two main components. The left part (blue) represents our powerful multi-task network, LidarMultiNet ye2022lidarmultinet, which generates accurate 3D object detection and provides rich voxel and bird's-eye-view (BEV) features. The right part (green) corresponds to our Keypoint Transformer (KPTR), predicting the 3D keypoints of each human box using various inputs from our first-stage network.
  • Figure 3: Illustration of Keypoint Transformer (KPTR). In the initial stage of our KPTR, we start by compressing the feature dimension of the box features. These compressed box features are then repeated and concatenated with the point features and point voxel features. The keypoint queries are generated from learnable embedding features. Then $L$ sequences of KPTR operations are performed on the keypoint queries and point tokens. Finally, the keypoint queries are passed through three distinct MLPs to learn the XY offsets, the Z offsets, and the visibilities of the 3D keypoints. Simultaneously, the point tokens are processed by an MLP to learn the point-wise segmentation labels for the 3D keypoints, which serves as an auxiliary task.
  • Figure 4: Prediction results on the whole scene with a significant number of pedestrians in the validation set.
  • Figure 5: Prediction results compared to the Ground Truth and the 1st stage results.