Table of Contents
Fetching ...

Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression

Nikolaos Stathoulopoulos, Christoforos Kanellakis, George Nikolakopoulos

TL;DR

This work tackles the expensive transmission of 3D point clouds for robotic systems by introducing a semantic scene graph–driven compression framework. It decomposes LiDAR scans into semantically coherent patches, encodes each patch with a FiLM-conditioned transformer to produce compact latent vectors, and uses a folding-based decoder guided by graph attributes to reconstruct high-fidelity geometry and semantics. The approach achieves state-of-the-art compression (up to 98% data reduction) while preserving downstream task performance such as pose graph optimization and map merging, even under strict bandwidth constraints. The method demonstrates strong generalization across datasets (SemanticKITTI to nuScenes) and highlights the value of integrating relational structure into compression for robust, task-aware 3D data handling.

Abstract

Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.

Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression

TL;DR

This work tackles the expensive transmission of 3D point clouds for robotic systems by introducing a semantic scene graph–driven compression framework. It decomposes LiDAR scans into semantically coherent patches, encodes each patch with a FiLM-conditioned transformer to produce compact latent vectors, and uses a folding-based decoder guided by graph attributes to reconstruct high-fidelity geometry and semantics. The approach achieves state-of-the-art compression (up to 98% data reduction) while preserving downstream task performance such as pose graph optimization and map merging, even under strict bandwidth constraints. The method demonstrates strong generalization across datasets (SemanticKITTI to nuScenes) and highlights the value of integrating relational structure into compression for robust, task-aware 3D data handling.

Abstract

Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview. A raw point cloud is first converted into a semantic scene graph (SSG), capturing object- and layer-level structure. The patch extractor (PE) then subdivides the scene into layer-specific patches, which are encoded by a transformer-based autoencoder into compact, per-patch latent vectors. These are later decoded to reconstruct the full point cloud. The proposed framework achieves extreme compression rates of up to 98%, with the encoded representation consisting solely of the scene graph and the set of latent vectors.
  • Figure 2: Semantic-aware Encoder. Overview of the proposed semantic-aware encoder, where each patch is processed alongside its semantic class. The right section illustrates the positional encoding module, which maps 3D coordinates to a high-dimensional space and projects them into the feature space. A FiLM module conditions point features using the semantic embedding of the patch, enabling the network to adapt representations based on semantic context. These features are then refined through a series of transformer blocks and pooled via spatial attention to produce a compact latent descriptor.
  • Figure 3: Scene graph-conditioned Decoder. Overview of the proposed decoder conditioned on the semantic scene graph attributes. Given the latent vector of a patch, the decoder generates a set of coarse points using the bounding box and learns per-point offsets. These coarse features are then upsampled via a folding operation guided by a fixed 2D grid, producing a dense reconstruction. A final confidence mask is predicted to prune low-quality outputs.
  • Figure 4: Compression results per codec. The left column presents results on SemanticKITTI (in-dataset evaluation), while the right column shows cross-dataset generalization on nuScenes, where models were trained on SemanticKITTI and applied directly to nuScenes without fine-tuning. Dashed lines indicate methods that do not encode semantic labels.
  • Figure 5: Qualitative comparisons. Qualitative results from the codecs that support semantic label encoding, for two scans of the SemanticKITTI (top: Sequence 00, bottom: Sequence 06), comparing the proposed method with baseline compression algorithms. At low bits-per-point, corresponding to a compression rate of 98%, our method achieves significantly better reconstruction quality in terms of geometric fidelity and scene completeness. For visual clarity, the ground has been removed in the zoomed-in views, which highlight fine-grained structures such as cars, infrastructure, tree trunks, poles, and signage.
  • ...and 1 more figures