Transformers with Stochastic Competition for Tabular Data Modelling
Andreas Voskou, Charalambos Christoforou, Sotirios Chatzis
TL;DR
This paper tackles the challenge of applying deep learning to tabular data by introducing a Transformer-based model augmented with stochastic competition mechanisms. It adds Local Winner Takes All (LWTA) units, an Embedding Mixture Layer for numerical features, and a Hybrid Transformer module to capture both dynamic and static feature interactions, all trained with a Bayesian averaging scheme. Empirical results on eight public tabular benchmarks show state-of-the-art performance on multiple datasets and robust gains in ablations, validating the method’s effectiveness in GBDT-dominated domains. The work demonstrates the viability of stochastic competition in tabular deep learning and offers a scalable framework with practical inference advantages via single-model Bayesian averaging.
Abstract
Despite the prevalence and significance of tabular data across numerous industries and fields, it has been relatively underexplored in the realm of deep learning. Even today, neural networks are often overshadowed by techniques such as gradient boosted decision trees (GBDT). However, recent models are beginning to close this gap, outperforming GBDT in various setups and garnering increased attention in the field. Inspired by this development, we introduce a novel stochastic deep learning model specifically designed for tabular data. The foundation of this model is a Transformer-based architecture, carefully adapted to cater to the unique properties of tabular data through strategic architectural modifications and leveraging two forms of stochastic competition. First, we employ stochastic "Local Winner Takes All" units to promote generalization capacity through stochasticity and sparsity. Second, we introduce a novel embedding layer that selects among alternative linear embedding layers through a mechanism of stochastic competition. The effectiveness of the model is validated on a variety of widely-used, publicly available datasets. We demonstrate that, through the incorporation of these elements, our model yields high performance and marks a significant advancement in the application of deep learning to tabular data.
