Table of Contents
Fetching ...

Hint: hierarchical inter-frame correlation for one-shot point cloud sequence compression

Yuchen Gao, Qi Zhang

TL;DR

Problem: efficient, low-latency lossless compression of sequential point clouds. Approach: HINT uses a hierarchical sparse representation and a temporal-spatial entropy model that fuses a two-stage temporal feature extraction (parent-level existence map and child-level neighborhood lookup) with parity-based sibling conditioned entropy coding, predicting occupancy distributions for groups of voxels. Contributions: multi-level sparse pyramid $(\mathcal{C}_d,\mathcal{O}_d)$, coarse-to-fine temporal cues, group-wise context, and strict causality that preserves parallelizable decoding. Results: on 8iVFBv2, achieves encoding/decoding times of 105 ms/140 ms and up to 43.6% bitrate reduction versus G-PCC, with robust gains over spatial baselines. Impact: enables practical, GPU-friendly dynamic point-cloud compression for streaming and storage without heavy motion estimation.

Abstract

Deep learning has demonstrated strong capability in compressing point clouds. Within this area, entropy modeling for lossless compression is widely investigated. However, most methods rely solely on parent/sibling contexts and level-wise autoregression, which suffers from decoding latency on the order of 10^1-10^2 seconds. We propose HINT, a method that integrates temporal and spatial correlation for sequential point cloud compression. Specifically, it first uses a two-stage temporal feature extraction: (i) a parent-level existence map and (ii) a child-level neighborhood lookup in the previous frame. These cues are fused with the spatial features via element-wise addition and encoded with a group-wise strategy. Experimental results show that HINT achieves encoding and decoding time at 105 ms and 140 ms, respectively, equivalent to 49.6x and 21.6x acceleration in comparison with G-PCC, while achieving up to 43.6% bitrate reduction and consistently outperforming the spatial-only baseline (RENO).

Hint: hierarchical inter-frame correlation for one-shot point cloud sequence compression

TL;DR

Problem: efficient, low-latency lossless compression of sequential point clouds. Approach: HINT uses a hierarchical sparse representation and a temporal-spatial entropy model that fuses a two-stage temporal feature extraction (parent-level existence map and child-level neighborhood lookup) with parity-based sibling conditioned entropy coding, predicting occupancy distributions for groups of voxels. Contributions: multi-level sparse pyramid , coarse-to-fine temporal cues, group-wise context, and strict causality that preserves parallelizable decoding. Results: on 8iVFBv2, achieves encoding/decoding times of 105 ms/140 ms and up to 43.6% bitrate reduction versus G-PCC, with robust gains over spatial baselines. Impact: enables practical, GPU-friendly dynamic point-cloud compression for streaming and storage without heavy motion estimation.

Abstract

Deep learning has demonstrated strong capability in compressing point clouds. Within this area, entropy modeling for lossless compression is widely investigated. However, most methods rely solely on parent/sibling contexts and level-wise autoregression, which suffers from decoding latency on the order of 10^1-10^2 seconds. We propose HINT, a method that integrates temporal and spatial correlation for sequential point cloud compression. Specifically, it first uses a two-stage temporal feature extraction: (i) a parent-level existence map and (ii) a child-level neighborhood lookup in the previous frame. These cues are fused with the spatial features via element-wise addition and encoded with a group-wise strategy. Experimental results show that HINT achieves encoding and decoding time at 105 ms and 140 ms, respectively, equivalent to 49.6x and 21.6x acceleration in comparison with G-PCC, while achieving up to 43.6% bitrate reduction and consistently outperforming the spatial-only baseline (RENO).

Paper Structure

This paper contains 9 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Rate-latency Pareto on MPEG 8iVFBv2.
  • Figure 2: Overview of HINT. Frame $t$ and $t-1$ are processed by the Temporal Module to produce a parent-level feature $T_{d}$ and a child-level feature $T_{d+1}$. The RENO path provides spatial feature $F_s$ from frame $t$ (Prior ResNet at parents). The parent-level features are fused by an element-wise addition and broadcast to child level. Then, it is fused with $T_{d+1}$ as the final feature. Each parent's eight children are split by parity into two groups, and we first predict $G_e$'s probability with the final feature. Ground truth $G_e$ is embedded and aggregated with final feature as additional information to predict $G_o$'s probability. The predicted probabilities are passed to an entropy encoder to produce the bitstream.
  • Figure 3: Temporal context and grouping module. a) Temporal (coarse) module. For each parent-level voxel at time $t$, we collect a size $V_d$ (e.g., $V_d$=27) neighborhood at frames $(t,t\!-\!1)$, concatenate the retrieved occupancy from the two frames. b) Temporal (fine) module. For each child-level voxel at frame $t$, we query a window at the corresponding location in frame $t\!-\!1$, feed it to an MLP to obtain a per-child temporal feature. c) Grouping. The children of each parent are split into odd and even groups.
  • Figure 4: BPP comparison on MPEG 8iVFBv2 (vox10).