Table of Contents
Fetching ...

Bidirectional Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression

Xuan Deng, Xingtao Wang, Xiandong Meng, Longguang Wang, Tiange Zhang, Xiaopeng Fan, Debin Zhao

TL;DR

A Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space, and a Random Access (RA) reference strategy that treats the bidirectionally aligned features as conditional context, enabling frame-level parallel compression and eliminating the sequential encoding.

Abstract

Efficient dynamic point cloud compression (DPCC) critically depends on accurate motion estimation and compensation. However, the inherently irregular structure and substantial local variations of point clouds make this task highly challenging. Existing approaches typically rely on explicit motion estimation, whose encoded motion vectors often fail to capture complex dynamics and inadequately exploit temporal correlations. To address these limitations, we propose a Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space. Bi-FMT aligns features across both past and future frames to produce temporally consistent latent representations, which serve as predictive context in a conditional coding pipeline, forming a unified ``Motion + Conditional'' representation. Built upon this bidirectional feature alignment, we introduce a Cross-Transformer Refinement module (CTR) at the decoder side to adaptively refine locally aligned features. By modeling cross-frame dependencies with vector attention, CRT enhances local consistency and restores fine-grained spatial details that are often lost during motion alignment. Moreover, we design a Random Access (RA) reference strategy that treats the bidirectionally aligned features as conditional context, enabling frame-level parallel compression and eliminating the sequential encoding. Extensive experiments demonstrate that Bi-FMT surpasses D-DPCC and AdaDPCC in both compression efficiency and runtime, achieving BD-Rate reductions of 20% (D1) and 9.4% (D1), respectively.

Bidirectional Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression

TL;DR

A Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space, and a Random Access (RA) reference strategy that treats the bidirectionally aligned features as conditional context, enabling frame-level parallel compression and eliminating the sequential encoding.

Abstract

Efficient dynamic point cloud compression (DPCC) critically depends on accurate motion estimation and compensation. However, the inherently irregular structure and substantial local variations of point clouds make this task highly challenging. Existing approaches typically rely on explicit motion estimation, whose encoded motion vectors often fail to capture complex dynamics and inadequately exploit temporal correlations. To address these limitations, we propose a Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space. Bi-FMT aligns features across both past and future frames to produce temporally consistent latent representations, which serve as predictive context in a conditional coding pipeline, forming a unified ``Motion + Conditional'' representation. Built upon this bidirectional feature alignment, we introduce a Cross-Transformer Refinement module (CTR) at the decoder side to adaptively refine locally aligned features. By modeling cross-frame dependencies with vector attention, CRT enhances local consistency and restores fine-grained spatial details that are often lost during motion alignment. Moreover, we design a Random Access (RA) reference strategy that treats the bidirectionally aligned features as conditional context, enabling frame-level parallel compression and eliminating the sequential encoding. Extensive experiments demonstrate that Bi-FMT surpasses D-DPCC and AdaDPCC in both compression efficiency and runtime, achieving BD-Rate reductions of 20% (D1) and 9.4% (D1), respectively.

Paper Structure

This paper contains 33 sections, 9 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: The framework employs a non-sequential hierarchical encoding mode. The input point cloud $X_t$ is downsampled via three sparse convolutions to produce $X^3_t$, consisting of coordinates $C^3_t$ and features $F^3_t$. These features are processed with reference point clouds $\hat{X}_{t-n}$ at time $t-n$ and $\hat{X}_{t+m}$ at time $t+m$ through a Bi-FMT module to generate temporally aligned features, which serve as temporal context for latent feature modeling. The downsampled coordinates $C^3_t$ are encoded using a learnable lossless codec, the coordinates $C^4_t$ extracted from the output $X^4_t$ of the Contextual Encoder are losslessly encoded using G-PCC, while the coordinate-related feature $F^4_t$ is encoded in a lossy manner using the Conditional Entropy Model and subsequently reconstructed as $\hat{F}^4_t$. The Contextual Decoder takes $X^4_t$ and the motion-aware context as input and outputs the decoded feature $\hat{F}^{3_\text{aglined}}_t$. This feature is further refined by the CTR module, yielding the refined representation $\hat{F}^{3_{\text{refined}}}_t$, which is then combined with the losslessly decoded coordinates $C^3_t$ to form the point cloud $\hat{X}^3_t$. Through three consecutive upsampling steps, $\hat{X}^3_t$ is reconstructed to the same scale as the original point cloud, resulting in $\hat{X}_t$. The decoded point cloud is then stored in either the forward or backward frame buffer.
  • Figure 2: Architectural details of the upsampling and downsampling modules.
  • Figure 3: Detailed Illustration of the Unidirectional Feature-aligned Motion Transformation for Spatiotemporal Alignment.
  • Figure 4: Illustration of the bi-directional weighting-based feature fusion.
  • Figure 5: Detailed Illustration of the Cross Transformer for Local Feature Refinement, $\text{AGG}$ denotes the $\odot$ operation, which represents the channel-wise multiplication between the attention vector and the value feature, enabling feature-wise modulation. $\omega : {R}^c \mapsto {R}^c$ is a learnable weight encoder (e.g., an MLP) that computes the attention vectors to re-weight $\mathbf{v}_j$ across feature channels before aggregation.
  • ...and 9 more figures