Table of Contents
Fetching ...

BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield Model

Chenwei Xu, Yu-Chao Huang, Jerry Yao-Chieh Hu, Weijian Li, Ammar Gilani, Hsi-Sheng Goan, Han Liu

TL;DR

The BiSHop framework, a novel end-to-end framework for deep tabular learning, is introduced, demonstrating that BiSHop surpasses current SOTA methods with significantly less HPO runs, marking it a robust solution for deep tabular learning.

Abstract

We introduce the \textbf{B}i-Directional \textbf{S}parse \textbf{Hop}field Network (\textbf{BiSHop}), a novel end-to-end framework for deep tabular learning. BiSHop handles the two major challenges of deep tabular learning: non-rotationally invariant data structure and feature sparsity in tabular data. Our key motivation comes from the recent established connection between associative memory and attention mechanisms. Consequently, BiSHop uses a dual-component approach, sequentially processing data both column-wise and row-wise through two interconnected directional learning modules. Computationally, these modules house layers of generalized sparse modern Hopfield layers, a sparse extension of the modern Hopfield model with adaptable sparsity. Methodologically, BiSHop facilitates multi-scale representation learning, capturing both intra-feature and inter-feature interactions, with adaptive sparsity at each scale. Empirically, through experiments on diverse real-world datasets, we demonstrate that BiSHop surpasses current SOTA methods with significantly less HPO runs, marking it a robust solution for deep tabular learning.

BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield Model

TL;DR

The BiSHop framework, a novel end-to-end framework for deep tabular learning, is introduced, demonstrating that BiSHop surpasses current SOTA methods with significantly less HPO runs, marking it a robust solution for deep tabular learning.

Abstract

We introduce the \textbf{B}i-Directional \textbf{S}parse \textbf{Hop}field Network (\textbf{BiSHop}), a novel end-to-end framework for deep tabular learning. BiSHop handles the two major challenges of deep tabular learning: non-rotationally invariant data structure and feature sparsity in tabular data. Our key motivation comes from the recent established connection between associative memory and attention mechanisms. Consequently, BiSHop uses a dual-component approach, sequentially processing data both column-wise and row-wise through two interconnected directional learning modules. Computationally, these modules house layers of generalized sparse modern Hopfield layers, a sparse extension of the modern Hopfield model with adaptable sparsity. Methodologically, BiSHop facilitates multi-scale representation learning, capturing both intra-feature and inter-feature interactions, with adaptive sparsity at each scale. Empirically, through experiments on diverse real-world datasets, we demonstrate that BiSHop surpasses current SOTA methods with significantly less HPO runs, marking it a robust solution for deep tabular learning.
Paper Structure (72 sections, 5 theorems, 15 equations, 4 figures, 22 tables)

This paper contains 72 sections, 5 theorems, 15 equations, 4 figures, 22 tables.

Key Result

Lemma 2.1

Given $t$ as the iteration number, the generalized sparse modern Hopfield model exhibits a retrieval dynamic which ensures a monotonic decrease of the energy eqn:GSH_energy.

Figures (4)

  • Figure 1: High-Level Visualization of BiSHop's Pipeline.
  • Figure 2: BiSHop.(a) Tabular Embedding: For a given input feature $\mathbf{x}=(\mathbf{x}^{\text{cat}},\mathbf{x}^{\text{num}}) \in\mathbb{R}^{N = N^{\text{cat}} + N^{\text{num}} }$, the tabular embedding produces embeddings denoted as $\mathbf{E}^{\text{emb}}(\mathbf{x})\in\mathbb{R}^{N\times G}$. (b) Patch Embedding: Using the combined numerical and categorical embeddings $\mathbf{E}^{\text{emb}}(\mathbf{x})\in\mathbb{R}^{N \times G}$, the patch embedding gathers embedding information, subsequently reducing dimensionality from $G$ to $P =\lceil G/L \rceil$ for all $N$ features using a stride length of $L$. (c) BiSHopModule: The Bi-Directional Sparse Hopfield Module (BiSHopModule) leverages the generalized sparse modern Hopfield model. It integrates the tabular structure's inductive bias (C1) by deploying interconnected row-wise and column-wise $\mathtt{GSH}$ layers. (d) Hierarchical Cellular Learning Module: Employing a stacked encoder-decoder structure, we facilitate hierarchical cellular learning where both the encoder and decoder consist of the BiSHopModule across $H$ layers. This arrangement enables BiSHop to derive refined representations from both directions across multiple scales. These representations are then concatenated for downstream inference, ensuring a holistic bi-directional cellular learning specially tailored for tabular data.
  • Figure 3: Changing Feature Sparsity. Following grinsztajn2022tree, we remove features in three ways: randomly (red), in increasing order of feature importance (purple), and in decreasing order of feature importance (blue), with feature importance determined by random forest. We report the average AUC score across all datasets for BiSHop, XGBoost, and LightGBM. The results highlight BiSHop's capability in handling sparse features.
  • Figure 4: Convergence Analysis. We plot the validation loss and AUC score curves of the generalized sparse Hopfield model ($\mathtt{GSH}$) and the dense Hopfield model ($\mathtt{Hopfield}$). The results, as shown by the solid lines for $\mathtt{GSH}$, indicate that the sparse Hopfield model converges faster and yields superior accuracy compared to the dense Hopfield model.

Theorems & Definitions (10)

  • Definition 2.1: peters2019sparse
  • Lemma 2.1: Retrieval Dynamics, Lemma 3.2 of wu2023stanhop
  • Lemma 2.2: Convergence of Retrieval Dynamics $\mathcal{T}$, Lemma 3.3 of wu2023stanhop
  • Definition 3.1: Stored and Retrieved
  • Definition 3.2: Pattern Separation
  • Theorem 3.1: Retrieval Error, Theorem 3.1 of wu2023stanhop
  • Corollary 3.1.1: Faster Convergence
  • Corollary 3.1.2: Noise-Robustness
  • Remark 3.1
  • Remark 3.2