Towards Point Cloud Compression for Machine Perception: A Simple and Strong Baseline by Learning the Octree Depth Level Predictor
Lei Liu, Zhihao Hu, Zhenghao Chen
TL;DR
This paper tackles the problem of compressing point clouds for both human and machine perception by introducing PCCMP-Net, a scalable coding baseline that partitions the bit-stream and adaptively selects octree depth levels per machine-vision task. The method integrates with mainstream octree codecs (e.g., VoxelContext-Net, OctAttention, G-PCC) and uses an octree depth level predictor trained with Gumbel-Softmax to allocate bits where they most improve classification, segmentation, or detection, while preserving the full bit-stream for human vision. Key contributions include (1) a simple, strong baseline for joint machine and human vision compression, (2) a bit-stream partitioning mechanism compatible with existing codecs, and (3) comprehensive experiments on ModelNet10/40, ShapeNet, ScanNet, and KITTI showing substantial machine-vision gains with no degradation in human-vision quality. The approach demonstrates meaningful bandwidth savings and accuracy improvements across multiple tasks, providing a practical, extensible framework to guide future research in point-cloud compression for machine perception. The work highlights the value of task-aware bit allocation in 3D data, enabling more efficient deployment in real-world systems where bandwidth and processing constraints are critical.
Abstract
Point cloud compression has garnered significant interest in computer vision. However, existing algorithms primarily cater to human vision, while most point cloud data is utilized for machine vision tasks. To address this, we propose a point cloud compression framework that simultaneously handles both human and machine vision tasks. Our framework learns a scalable bit-stream, using only subsets for different machine vision tasks to save bit-rate, while employing the entire bit-stream for human vision tasks. Building on mainstream octree-based frameworks like VoxelContext-Net, OctAttention, and G-PCC, we introduce a new octree depth-level predictor. This predictor adaptively determines the optimal depth level for each octree constructed from a point cloud, controlling the bit-rate for machine vision tasks. For simpler tasks (\textit{e.g.}, classification) or objects/scenarios, we use fewer depth levels with fewer bits, saving bit-rate. Conversely, for more complex tasks (\textit{e.g}., segmentation) or objects/scenarios, we use deeper depth levels with more bits to enhance performance. Experimental results on various datasets (\textit{e.g}., ModelNet10, ModelNet40, ShapeNet, ScanNet, and KITTI) show that our point cloud compression approach improves performance for machine vision tasks without compromising human vision quality.
