Table of Contents
Fetching ...

MME: Mixture of Mesh Experts with Random Walk Transformer Gating

Amir Belder, Ayellet Tal

TL;DR

A new gate architecture is proposed that encourages each expert to specialise in the classes it excels in and adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts.

Abstract

In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert's decision-making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: https://github.com/amirbelder/MME-Mixture-of-Mesh-Experts.

MME: Mixture of Mesh Experts with Random Walk Transformer Gating

TL;DR

A new gate architecture is proposed that encourages each expert to specialise in the classes it excels in and adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts.

Abstract

In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert's decision-making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: https://github.com/amirbelder/MME-Mixture-of-Mesh-Experts.
Paper Structure (11 sections, 3 equations, 6 figures, 10 tables)

This paper contains 11 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Model. At each iteration ($t$), the expert environment receives a batch of meshes along with a weighting factor ($\lambda_t$) that balances the loss terms. It outputs a weight for each expert ($s_t$) and the batch's accuracy ($r_t$), which are then passed to the agent as input for the next iteration ($t+1$).
  • Figure 2: Expert environment. At each iteration $t$, all experts receive the same batch of meshes and independently produce their predictions. The same meshes are also fed into the random walk extractor, which generates walk-based representations. The extracted random walks are then passed to the Transformer gate as input, which assigns weights to the experts. The experts' predictions, along with the gate's weights, are passed to the expert chooser, that selects which expert's prediction to use for each mesh, specifically, by choosing the one with the highest assigned weight. The similarity loss, weighted by $\lambda_t$, is applied to the experts' predictions ($V_1, \dots, V_J$), while the diversity loss is applied to the final prediction, $V_{\text{chosen}}$. The gate's weightings form the next state ($s_t$) and the batch's accuracy forms the next reward ($r_t$).
  • Figure 3: Transformer gate. We zoom into the gate model. The gate's input is comprised of a random walk which is extracted from a given input mesh. The gate outputs a weight for each of the $J$ experts, which is the key to selecting the prediction of the most suitable expert for the mesh. The gate is comprised of an encoder and a decoder. The encoder processes the walk, and its output serves as input to the decoder. It consist of a single embedding layer followed by $8$Multi-Head Attention (MHA) layers. The decoder outputs a single weight for each expert (leading to an output at the size of $J$). It consists of a single embedding layer, followed by $8$ MHA layers, and a FC layer at the size of the number of experts ($J$).
  • Figure 4: Qualitative results. Body parts are clearly segmented and consistently labeled across poses.
  • Figure 5: Qualitative results of segmentation (COSEG). Our segmentations closely match the ground truth, whereas PD-MeshNet produces errors on these objects. This is because our model selected MeshCNN as the expert for these cases, rather than PD-MeshNet.
  • ...and 1 more figures