Table of Contents
Fetching ...

(GG) MoE vs. MLP on Tabular Data

Andrei Chernov

TL;DR

This work centers on tabular data, where deep neural networks often underperform compared with gradient-boosted trees and efficient MLP ensembles. It introduces GG MoE, a mixture-of-experts model with a Gumbel-Softmax gating mechanism, and shows that GG MoE with an embedding layer achieves the highest average performance across $38$ datasets while using far fewer parameters than MLP baselines. The study emphasizes parameter efficiency and inference practicality, demonstrating that Monte Carlo gating incurs minimal overhead and that a small number of samples ($\approx 10$) suffices for accurate MC estimates. The findings suggest MoE-based architectures, particularly GG MoE with embeddings, are promising candidates for scalable, ensemble-friendly tabular learning, warranting further exploration of hierarchical and sparse MoE variants.

Abstract

In recent years, significant efforts have been directed toward adapting modern neural network architectures for tabular data. However, despite their larger number of parameters and longer training and inference times, these models often fail to consistently outperform vanilla multilayer perceptron (MLP) neural networks. Moreover, MLP-based ensembles have recently demonstrated superior performance and efficiency compared to advanced deep learning methods. Therefore, rather than focusing on building deeper and more complex deep learning models, we propose investigating whether MLP neural networks can be replaced with more efficient architectures without sacrificing performance. In this paper, we first introduce GG MoE, a mixture-of-experts (MoE) model with a Gumbel-Softmax gating function. We then demonstrate that GG MoE with an embedding layer achieves the highest performance across $38$ datasets compared to standard MoE and MLP models. Finally, we show that both MoE and GG MoE utilize significantly fewer parameters than MLPs, making them a promising alternative for scaling and ensemble methods.

(GG) MoE vs. MLP on Tabular Data

TL;DR

This work centers on tabular data, where deep neural networks often underperform compared with gradient-boosted trees and efficient MLP ensembles. It introduces GG MoE, a mixture-of-experts model with a Gumbel-Softmax gating mechanism, and shows that GG MoE with an embedding layer achieves the highest average performance across datasets while using far fewer parameters than MLP baselines. The study emphasizes parameter efficiency and inference practicality, demonstrating that Monte Carlo gating incurs minimal overhead and that a small number of samples () suffices for accurate MC estimates. The findings suggest MoE-based architectures, particularly GG MoE with embeddings, are promising candidates for scalable, ensemble-friendly tabular learning, warranting further exploration of hierarchical and sparse MoE variants.

Abstract

In recent years, significant efforts have been directed toward adapting modern neural network architectures for tabular data. However, despite their larger number of parameters and longer training and inference times, these models often fail to consistently outperform vanilla multilayer perceptron (MLP) neural networks. Moreover, MLP-based ensembles have recently demonstrated superior performance and efficiency compared to advanced deep learning methods. Therefore, rather than focusing on building deeper and more complex deep learning models, we propose investigating whether MLP neural networks can be replaced with more efficient architectures without sacrificing performance. In this paper, we first introduce GG MoE, a mixture-of-experts (MoE) model with a Gumbel-Softmax gating function. We then demonstrate that GG MoE with an embedding layer achieves the highest performance across datasets compared to standard MoE and MLP models. Finally, we show that both MoE and GG MoE utilize significantly fewer parameters than MLPs, making them a promising alternative for scaling and ensemble methods.

Paper Structure

This paper contains 25 sections, 7 equations, 1 figure, 9 tables, 1 algorithm.

Figures (1)

  • Figure 1: The average rank for each model across $38$ datasets. For each dataset, ranks were computed independently using \ref{['alg:ranking']}.