Table of Contents
Fetching ...

An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, Liang Xiao

TL;DR

This work targets robust 3D point-cloud semantic segmentation by reframing diffusion-based methods with a Conditional-Noise Framework (CNF). By making the Conditional Network (CN) the dominant backbone and treating the Noise Network (NN) as an auxiliary perturbation source, CDSegNet achieves strong noise and sparsity robustness while enabling single-step inference during deployment. The approach yields state-of-the-art results on indoor and outdoor benchmarks (e.g., ScanNet, ScanNet200, nuScenes) and demonstrates portability across backbones with minimal runtime overhead. The contribution lies in CNF design, the CDSegNet architecture, and extensive analyses that illuminate why diffusion-inspired noise perturbations can improve discriminative semantic segmentation in noisy 3D scenes.

Abstract

Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic Segmentation Network based on a Conditional-Noise Framework (CNF) of DDPMs, named CDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.

An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

TL;DR

This work targets robust 3D point-cloud semantic segmentation by reframing diffusion-based methods with a Conditional-Noise Framework (CNF). By making the Conditional Network (CN) the dominant backbone and treating the Noise Network (NN) as an auxiliary perturbation source, CDSegNet achieves strong noise and sparsity robustness while enabling single-step inference during deployment. The approach yields state-of-the-art results on indoor and outdoor benchmarks (e.g., ScanNet, ScanNet200, nuScenes) and demonstrates portability across backbones with minimal runtime overhead. The contribution lies in CNF design, the CDSegNet architecture, and extensive analyses that illuminate why diffusion-inspired noise perturbations can improve discriminative semantic segmentation in noisy 3D scenes.

Abstract

Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic Segmentation Network based on a Conditional-Noise Framework (CNF) of DDPMs, named CDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.

Paper Structure

This paper contains 40 sections, 29 equations, 20 figures, 23 tables.

Figures (20)

  • Figure 1: The training and inference difference between NCF and CNF. NCF dominated by NN, relies on the noise fitting quality, requiring extensive training and inference iterations. In contrast, CNF alleviates the noise fitting necessity by focusing on CN, cleverly avoiding this issue, alongside retaining the DDPM robustness.
  • Figure 1: The visualization of the predefined diffusion process $q(\bm{x_t}|\bm{x_{t-1}})$, the inverse of the diffusion process $q(\bm{x_{t-1}}|\bm{x_t},\bm{x_0})$, and the trainable conditional generation process $p_\theta(\bm{x_{t-1}}|\bm{x_t},C)$. In the diffusion process $q(\bm{x_t}|\bm{x_{t-1}})$, the task target $\bm{x_0}$ is gradually noised until $\bm{x_0}$ degrades to $\bm{z}$ ($\bm{x_T}$). Meanwhile, the inverse of the diffusion process (the true Ground Truth in DDPMs) $q(\bm{x_{t-1}}|\bm{x_t},\bm{x_0})$ can be calculated by the predefined distribution in the diffusion process. Furthermore, the generation process $p_\theta(\bm{x_{t-1}}|\bm{x_t},C)$ gradually fits the inverse of the diffusion process $q(\bm{x_{t-1}}|\bm{x_t},\bm{x_0})$ until $\bm{z}$ ($\bm{x_T}$) is restored to $\bm{x_0}$ conditioned on $C=\{\bm{c},t\}$ (unconditional generation, $\bm{c}=\emptyset$, in the formula derivation of DDPMs, the time label $t$ is usually ignored.).
  • Figure 2: We try several combinations for conditional DDPMs built on the baseline (Sec. \ref{['5.1']}) on ScanNet in (a). GD+CD (NCF) indicates that NN and CN are modeled as Gaussian ho2020denoising and categorical austin2021structured diffusion (details in the supplementary material). To better fit noise, these combinations dominated by NN ( ③, ④, ⑤, ⑥) require more iterations to converge than CNF, but exhibiting poorer performance, due to complex scene distribution. (b) shows the inference time cost of CNF and NCF under the same baseline. CNF achieves better performance with fewer iterations.
  • Figure 2: Applying DDPMs to the point cloud semantic segmentation task: the semantic label $\bm{x_0}$ is gradually perturbed with noise during the diffusion process and slowly reconstructed during the generation process conditioned on the segmented point cloud $\bm{c}$.
  • Figure 3: The overall framework of CDSegNet. The auxiliary Noise Network (NN), seen as a noise-feature generator, modeling the diffusion process conditioned on the time label, perturbs the input features at different noise levels. Meanwhile, the Feature Fusion Module (FFM) controls the noise information flow direction, achieving the semantic feature augmentation by reasonably filtering the perturbations. Furthermore, the dominant Conditional Network (CN) predicts the segmentation results in a pure manner.
  • ...and 15 more figures