Table of Contents
Fetching ...

Towards Point Cloud Compression for Machine Perception: A Simple and Strong Baseline by Learning the Octree Depth Level Predictor

Lei Liu, Zhihao Hu, Zhenghao Chen

TL;DR

This paper tackles the problem of compressing point clouds for both human and machine perception by introducing PCCMP-Net, a scalable coding baseline that partitions the bit-stream and adaptively selects octree depth levels per machine-vision task. The method integrates with mainstream octree codecs (e.g., VoxelContext-Net, OctAttention, G-PCC) and uses an octree depth level predictor trained with Gumbel-Softmax to allocate bits where they most improve classification, segmentation, or detection, while preserving the full bit-stream for human vision. Key contributions include (1) a simple, strong baseline for joint machine and human vision compression, (2) a bit-stream partitioning mechanism compatible with existing codecs, and (3) comprehensive experiments on ModelNet10/40, ShapeNet, ScanNet, and KITTI showing substantial machine-vision gains with no degradation in human-vision quality. The approach demonstrates meaningful bandwidth savings and accuracy improvements across multiple tasks, providing a practical, extensible framework to guide future research in point-cloud compression for machine perception. The work highlights the value of task-aware bit allocation in 3D data, enabling more efficient deployment in real-world systems where bandwidth and processing constraints are critical.

Abstract

Point cloud compression has garnered significant interest in computer vision. However, existing algorithms primarily cater to human vision, while most point cloud data is utilized for machine vision tasks. To address this, we propose a point cloud compression framework that simultaneously handles both human and machine vision tasks. Our framework learns a scalable bit-stream, using only subsets for different machine vision tasks to save bit-rate, while employing the entire bit-stream for human vision tasks. Building on mainstream octree-based frameworks like VoxelContext-Net, OctAttention, and G-PCC, we introduce a new octree depth-level predictor. This predictor adaptively determines the optimal depth level for each octree constructed from a point cloud, controlling the bit-rate for machine vision tasks. For simpler tasks (\textit{e.g.}, classification) or objects/scenarios, we use fewer depth levels with fewer bits, saving bit-rate. Conversely, for more complex tasks (\textit{e.g}., segmentation) or objects/scenarios, we use deeper depth levels with more bits to enhance performance. Experimental results on various datasets (\textit{e.g}., ModelNet10, ModelNet40, ShapeNet, ScanNet, and KITTI) show that our point cloud compression approach improves performance for machine vision tasks without compromising human vision quality.

Towards Point Cloud Compression for Machine Perception: A Simple and Strong Baseline by Learning the Octree Depth Level Predictor

TL;DR

This paper tackles the problem of compressing point clouds for both human and machine perception by introducing PCCMP-Net, a scalable coding baseline that partitions the bit-stream and adaptively selects octree depth levels per machine-vision task. The method integrates with mainstream octree codecs (e.g., VoxelContext-Net, OctAttention, G-PCC) and uses an octree depth level predictor trained with Gumbel-Softmax to allocate bits where they most improve classification, segmentation, or detection, while preserving the full bit-stream for human vision. Key contributions include (1) a simple, strong baseline for joint machine and human vision compression, (2) a bit-stream partitioning mechanism compatible with existing codecs, and (3) comprehensive experiments on ModelNet10/40, ShapeNet, ScanNet, and KITTI showing substantial machine-vision gains with no degradation in human-vision quality. The approach demonstrates meaningful bandwidth savings and accuracy improvements across multiple tasks, providing a practical, extensible framework to guide future research in point-cloud compression for machine perception. The work highlights the value of task-aware bit allocation in 3D data, enabling more efficient deployment in real-world systems where bandwidth and processing constraints are critical.

Abstract

Point cloud compression has garnered significant interest in computer vision. However, existing algorithms primarily cater to human vision, while most point cloud data is utilized for machine vision tasks. To address this, we propose a point cloud compression framework that simultaneously handles both human and machine vision tasks. Our framework learns a scalable bit-stream, using only subsets for different machine vision tasks to save bit-rate, while employing the entire bit-stream for human vision tasks. Building on mainstream octree-based frameworks like VoxelContext-Net, OctAttention, and G-PCC, we introduce a new octree depth-level predictor. This predictor adaptively determines the optimal depth level for each octree constructed from a point cloud, controlling the bit-rate for machine vision tasks. For simpler tasks (\textit{e.g.}, classification) or objects/scenarios, we use fewer depth levels with fewer bits, saving bit-rate. Conversely, for more complex tasks (\textit{e.g}., segmentation) or objects/scenarios, we use deeper depth levels with more bits to enhance performance. Experimental results on various datasets (\textit{e.g}., ModelNet10, ModelNet40, ShapeNet, ScanNet, and KITTI) show that our point cloud compression approach improves performance for machine vision tasks without compromising human vision quality.
Paper Structure (15 sections, 4 equations, 5 figures)

This paper contains 15 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: The classification results of a pre-trained PointNet++ qi2017pointnetplusplus model for recognizing the point clouds reconstructed from different octree depth levels. "raw" means the raw/original point cloud. The truly or falsely predicted results from the classification task are shown under the point clouds.
  • Figure 2: (a) The octree-based encoding and decoding process. The bit-stream $\mathrm{b_1}$ is used for the first machine vision task, while the bit-stream $\mathrm{b_1} \cup \mathrm{b_2}$ will be used for the second machine vision task. And the bit-stream $\mathrm{b_1} \cup \mathrm{b_2} \cup ... \cup \mathrm{b_n}$ denotes the full bit-stream for human vision. (b) The overall network architecture of our PCCMP-Net. (c) Details of our proposed octree depth level (ODL) predictor.
  • Figure 3: Results in (a-c) are only for one single machine vision task (i.e., the classification task) on the ModelNet10 and ModelNet40 datasets. The multi-task results in (f), (i) are for both classification and segmentation tasks on the ShapeNet dataset. The multi-task results in (d), (e),(g), (h), (j-l) are for both segmentation and detection tasks on the ScanNet and KITTI datasets. "Ours (VoxelContext-Net)"/"Ours (OctAttention)"/"Ours (G-PCC)" means VoxelContext-Net, OctAttention and G-PCC are used as the encoder and the decoder in our PCCMP-Net, respectively. The results of PointNet++, VoteNet, and PointRCNN are obtained by using the raw/uncompressed point cloud as the input.
  • Figure 4: Different qualitative results for the segmentation task on the ShapeNet dataset (a) and (b). Different qualitative results for the detection task on the ScanNet dataset (c) and (d).
  • Figure 5: The selection percentage of different octree depth levels at different bpp values for the classification task on the ModelNet40 dataset (a), and the detection task on the ScanNet dataset (b) and the KITTI dataset (c). Different colors represent different depth levels.