Table of Contents
Fetching ...

Inter-Frame Compression for Dynamic Point Cloud Geometry Coding

Anique Akhtar, Zhu Li, Geert Van der Auwera

TL;DR

A deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network.

Abstract

Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. This paper proposes a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network. The proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. The proposed method introduces a novel predictor network for motion compensation in the feature domain to map the latent representation of the previous frame to the coordinates of the current frame to predict the current frame's feature embedding. The framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. The proposed framework is compared to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). The proposed method achieves more than 88% BD-Rate (Bjontegaard Delta Rate) reduction against G-PCCv20 Octree, more than 56% BD-Rate savings against G-PCCv20 Trisoup, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC. These significant performance gains are cross-checked and verified in the MPEG working group.

Inter-Frame Compression for Dynamic Point Cloud Geometry Coding

TL;DR

A deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network.

Abstract

Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. This paper proposes a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network. The proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. The proposed method introduces a novel predictor network for motion compensation in the feature domain to map the latent representation of the previous frame to the coordinates of the current frame to predict the current frame's feature embedding. The framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. The proposed framework is compared to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). The proposed method achieves more than 88% BD-Rate (Bjontegaard Delta Rate) reduction against G-PCCv20 Octree, more than 56% BD-Rate savings against G-PCCv20 Trisoup, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC. These significant performance gains are cross-checked and verified in the MPEG working group.
Paper Structure (23 sections, 6 equations, 8 figures, 4 tables)

This paper contains 23 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: System Model. The previously decoded frame $\widetilde{P^1}$ is employed to encode a feature embedding of the current frame $P^2$. Multiscale features from $\widetilde{P^1}$ and three-times downsampled coordinates $C^2_{3ds}$ from $P^2$ are passed to the Predictor network to learn a feature embedding $\widehat{P^2_{3ds}} = \{C^2_{3ds}, \widehat{F^2_{3ds}}\}$. The current frame's three-times downsampled coordinates $C^2_{3ds}$ are transmitted in a lossless manner using an octree encoder. The predicted downsampled features $\widehat{F^2_{3ds}}$ and the original downsampled features $F^2_{3ds}$ are subtracted to obtain the residual features $R^2_{3ds}$. The residual is transmitted in a lossy manner using a learned entropy model. The same Encoder and Predictor module are used throughout the system. Q, AE, and AD stand for quantization, arithmetic encoder, and arithmetic decoder respectively.
  • Figure 2: Encoder and Decoder Network. The encoder network takes the original point cloud sparse tensor $P$, and creates sparse features at four different scales: $P_{0ds}$, $P_{1ds}$, $P_{2ds}$, and $P_{3ds}$. Where $P_{3ds}$ denotes three-times downsampled sparse tensor containing both the coordinates $C_{3ds}$ and their respective features $F_{3ds}$. The decoder network takes the three-times downsampled sparse tensor and hierarchically reconstructs the original point cloud by progressively rescaling. The decoder upsamples the sparse tensor one scale at a time using transpose convolution followed by classification and pruning to prune out the false voxels.
  • Figure 3: Example of classification and pruning layer with input sparse tensor $P_a$ and output sparse tensor $P_c$. Binary classification is applied to $P_b$ to chose the top voxels and prune false voxels from $P_a$ to obtain $P_c$.
  • Figure 4: Prediction network. Takes in four multiscale features from the previous frame and the three-times downsampled coordinates of the current frame $(C^2_{3ds})$ to learn the current frame's feature embedding $\widehat{P^2_{3ds}}$.
  • Figure 5: Comparison between the two generalized sparse convolutions employed in the proposed framework. Shown in 2D with blue as the output coordinates ($C^{\text{out}}$) and green as the input coordinates ($C^{\text{in}}$).
  • ...and 3 more figures