Table of Contents
Fetching ...

Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers

Mohammadreza Heidarianbaei, Max Mehltretter, Franz Rottensteiner

Abstract

Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.

Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers

Abstract

Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.

Paper Structure

This paper contains 27 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overview of the network architecture. The feature extraction branch generates one feature vector per face. The faces are clustered using K-means clustering. The tensor containing the feature vectors is reshaped so that there is one sequence of feature vectors per cluster, and a learnable cluster token is prepended to each sequence. These feature vectors are processed by a sequence of $N_{Bl}$ TSTB. The output of the final block is used to predict the class scores and labels. Numbers in brackets denote tensor dimensions.
  • Figure 2: Architecture of a TSTB. It consists of two sub-blocks: a local block in which the feature vectors within each face cluster can interact with each other, and a cross-cluster block designed to capture global context by allowing the cluster tokens of different clusters to interact. Numbers in brackets are dimensions of tensors.
  • Figure 3: Qualitative results achieved by our method and by the best-performing baseline, NoMeFormer. From left to right: Input, reference, our method, and NoMeFormer. Top: an example from SUM. Bottom: an example from CH. Colour code — SUM: $T$ (blue), $V$ (dark green), $Bld$ (green), $W$ (yellow), $C$ (orange), $B$ (red). Colour code — CH: $ND$ (greenish blue), $J$ (red), $St$ (green), $D$ (chartreuse green), $LI$ (gold), $BC$ (yellow), $SE$ (orange), unlabelled (blue).