Table of Contents
Fetching ...

TSC-PCAC: Voxel Transformer and Sparse Convolution Based Point Cloud Attribute Compression for 3D Broadcasting

Zixi Guo, Yun Zhang, Linwei Zhu, Hanli Wang, Gangyi Jiang

TL;DR

The paper tackles the heavy bitrate burden of point cloud attributes in 3D broadcasting by proposing TSC-PCAC, an end-to-end framework that fuses a voxel-transformer with sparse convolution-based autoencoding and a TSCM-based channel context model. It introduces a two-stage TSCM to jointly capture local and global interpoint dependencies and a channel-wise context mechanism to exploit interchannel correlations for improved entropy modeling. Empirical results show substantial bitrate reductions (up to 38.53% BD-BR versus Sparse-PCAC and notable PSNR gains) with favorable encoding/decoding efficiency, validating the effectiveness of local/global feature fusion and channel-wise context in PCAC. The approach advances learned PCAC by leveraging Transformer and sparse convolution synergy, and code is publicly available for reproducible evaluation.

Abstract

Point cloud has been the mainstream representation for advanced 3D applications, such as virtual reality and augmented reality. However, the massive data amounts of point clouds is one of the most challenging issues for transmission and storage. In this paper, we propose an end-to-end voxel Transformer and Sparse Convolution based Point Cloud Attribute Compression (TSC-PCAC) for 3D broadcasting. Firstly, we present a framework of the TSC-PCAC, which include Transformer and Sparse Convolutional Module (TSCM) based variational autoencoder and channel context module. Secondly, we propose a two-stage TSCM, where the first stage focuses on modeling local dependencies and feature representations of the point clouds, and the second stage captures global features through spatial and channel pooling encompassing larger receptive fields. This module effectively extracts global and local interpoint relevance to reduce informational redundancy. Thirdly, we design a TSCM based channel context module to exploit interchannel correlations, which improves the predicted probability distribution of quantized latent representations and thus reduces the bitrate. Experimental results indicate that the proposed TSC-PCAC method achieves an average of 38.53%, 21.30%, and 11.19% Bjontegaard Delta bitrate reductions compared to the Sparse-PCAC, NF-PCAC, and G-PCC v23 methods, respectively. The encoding/decoding time costs are reduced up to 97.68%/98.78% on average compared to the Sparse-PCAC. The source code and the trained models of the TSC-PCAC are available at https://github.com/igizuxo/TSC-PCAC.

TSC-PCAC: Voxel Transformer and Sparse Convolution Based Point Cloud Attribute Compression for 3D Broadcasting

TL;DR

The paper tackles the heavy bitrate burden of point cloud attributes in 3D broadcasting by proposing TSC-PCAC, an end-to-end framework that fuses a voxel-transformer with sparse convolution-based autoencoding and a TSCM-based channel context model. It introduces a two-stage TSCM to jointly capture local and global interpoint dependencies and a channel-wise context mechanism to exploit interchannel correlations for improved entropy modeling. Empirical results show substantial bitrate reductions (up to 38.53% BD-BR versus Sparse-PCAC and notable PSNR gains) with favorable encoding/decoding efficiency, validating the effectiveness of local/global feature fusion and channel-wise context in PCAC. The approach advances learned PCAC by leveraging Transformer and sparse convolution synergy, and code is publicly available for reproducible evaluation.

Abstract

Point cloud has been the mainstream representation for advanced 3D applications, such as virtual reality and augmented reality. However, the massive data amounts of point clouds is one of the most challenging issues for transmission and storage. In this paper, we propose an end-to-end voxel Transformer and Sparse Convolution based Point Cloud Attribute Compression (TSC-PCAC) for 3D broadcasting. Firstly, we present a framework of the TSC-PCAC, which include Transformer and Sparse Convolutional Module (TSCM) based variational autoencoder and channel context module. Secondly, we propose a two-stage TSCM, where the first stage focuses on modeling local dependencies and feature representations of the point clouds, and the second stage captures global features through spatial and channel pooling encompassing larger receptive fields. This module effectively extracts global and local interpoint relevance to reduce informational redundancy. Thirdly, we design a TSCM based channel context module to exploit interchannel correlations, which improves the predicted probability distribution of quantized latent representations and thus reduces the bitrate. Experimental results indicate that the proposed TSC-PCAC method achieves an average of 38.53%, 21.30%, and 11.19% Bjontegaard Delta bitrate reductions compared to the Sparse-PCAC, NF-PCAC, and G-PCC v23 methods, respectively. The encoding/decoding time costs are reduced up to 97.68%/98.78% on average compared to the Sparse-PCAC. The source code and the trained models of the TSC-PCAC are available at https://github.com/igizuxo/TSC-PCAC.
Paper Structure (24 sections, 8 equations, 8 figures, 4 tables)

This paper contains 24 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Differences between dense and sparse convolution with kernel size $3\times3\times3$. (a) dense convolution, (b) sparse convolution.
  • Figure 2: Weight differences between convolution and local attention for voxels. (a) convolution in CNN. (b) local attention in Transformer, where $\mathbf{Q},\mathbf{K},\mathbf{V}$ denote three linear layers of query, key and value, respectively.
  • Figure 3: Framework of the proposed TSC-PCAC, where green and yellow rectangles are proposed TSCM and TSCM based channel context model.
  • Figure 4: The structure of the TSCM and its key module. (a) TSCM, (b) Voxel-based Global Block. The 'split' operation performs channel-wise splitting. 'SP Avg Pooling' refers to spatial average pooling, which applies global pooling operation across spatial dimensions, and 'CH Avg Pooling' refers to channel-wise average pooling, which applies global pooling operation across channel dimensions. The 'sqrt' represents the square root operation on the input.
  • Figure 5: Structure of the proposed TSCM based channel context module. (a) channel context module, (b) TSCM based transform.
  • ...and 3 more figures