Table of Contents
Fetching ...

Two Heads are Better than One: Geometric-Latent Attention for Point Cloud Classification and Segmentation

Hanz Cuevas-Velasquez, Antonio Javier Gallego, Robert B. Fisher

TL;DR

An innovative two-headed attention layer that combines geometric and latent features to segment a 3D scene into semantically meaningful subsets that allows it to achieve competitive results in the ShapeNetPart and ModelNet40 datasets, and the state-of-the-art when segmenting the complex dataset S3DIS.

Abstract

We present an innovative two-headed attention layer that combines geometric and latent features to segment a 3D scene into semantically meaningful subsets. Each head combines local and global information, using either the geometric or latent features, of a neighborhood of points and uses this information to learn better local relationships. This Geometric-Latent attention layer (Ge-Latto) is combined with a sub-sampling strategy to capture global features. Our method is invariant to permutation thanks to the use of shared-MLP layers, and it can also be used with point clouds with varying densities because the local attention layer does not depend on the neighbor order. Our proposal is simple yet robust, which allows it to achieve competitive results in the ShapeNetPart and ModelNet40 datasets, and the state-of-the-art when segmenting the complex dataset S3DIS, with 69.2% IoU on Area 5, and 89.7% overall accuracy using K-fold cross-validation on the 6 areas.

Two Heads are Better than One: Geometric-Latent Attention for Point Cloud Classification and Segmentation

TL;DR

An innovative two-headed attention layer that combines geometric and latent features to segment a 3D scene into semantically meaningful subsets that allows it to achieve competitive results in the ShapeNetPart and ModelNet40 datasets, and the state-of-the-art when segmenting the complex dataset S3DIS.

Abstract

We present an innovative two-headed attention layer that combines geometric and latent features to segment a 3D scene into semantically meaningful subsets. Each head combines local and global information, using either the geometric or latent features, of a neighborhood of points and uses this information to learn better local relationships. This Geometric-Latent attention layer (Ge-Latto) is combined with a sub-sampling strategy to capture global features. Our method is invariant to permutation thanks to the use of shared-MLP layers, and it can also be used with point clouds with varying densities because the local attention layer does not depend on the neighbor order. Our proposal is simple yet robust, which allows it to achieve competitive results in the ShapeNetPart and ModelNet40 datasets, and the state-of-the-art when segmenting the complex dataset S3DIS, with 69.2% IoU on Area 5, and 89.7% overall accuracy using K-fold cross-validation on the 6 areas.

Paper Structure

This paper contains 11 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The encoder-decoder architecture receives as input $xyz$ coordinates and $RGB$. The figure shows the effect of the sub-sample process on the $xyz$ values and the output features of each encoder layer. The encoder consists of ResNet blocks, which have our Ge-Latto layer (see Figure \ref{['fig:resnet_blocks']}). The decoder consists of up-sampling layers which are concatenated with their respective encoder features using residual connections and combined with an MLP.
  • Figure 2: Clustering process inside ResNet blocks. The first image shows the input points of the layer. The grouping criteria of Block 1 and 2 are shown in the second and third image. Block 1 groups the input points using the sampled points as centers with a radius $r_1$, whereas Block 2 does the grouping on the sampled points with a bigger radius $r_2$.
  • Figure 3: ResNet Blocks. The first block sub-samples the point cloud and finds nearest neighbors inside a radius between the sampled points and the input points. Because of the sampling, the residual connection has a maxpooling layer to match the input with the output size. The function of the second block is similar to the first one, but without sub-sampling.
  • Figure 4: The two-headed Ge-Latto layer computes the local-attention for the geometric and latent features individually and then combines them using $f_i$ MLP layers.
  • Figure 5: S3DIS results.
  • ...and 1 more figures