Table of Contents
Fetching ...

Unlocking the Transferability of Tokens in Deep Models for Tabular Data

Qi-Le Zhou, Han-Jia Ye, Le-Ye Wang, De-Chuan Zhan

TL;DR

The paper tackles transfer learning for deep tabular models when upstream and downstream feature spaces differ. It introduces TabToken, a token-centric approach that enriches feature embeddings with semantics through a Contrastive Token Regularization (CTR) loss during semantic pre-training. During fine-tuning, overlapping tokens are kept fixed while unseen feature tokens are initialized by averaging and regularized to preserve semantic structure, enabling effective transfer with limited data, formalized by the objective $ \min_{f_0=g_0\circ h_0} \sum_{i=1}^N \ell(g_0(h_0(\boldsymbol{x}_{i,:})), y_i) + \beta \Omega(\{\boldsymbol{T}_i\})$. Empirical results on 10 tabular datasets show strong cross-feature transfer performance and improved discriminative power for standard classification and regression tasks, highlighting the practical impact of enhancing feature token semantics in tabular deep learning.

Abstract

Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks. However, such a paradigm becomes particularly challenging with tabular data when there are discrepancies between the feature sets of pre-trained models and the target tasks. In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens (i.e., embeddings of tabular features). TabToken allows for the utilization of pre-trained models when the upstream and downstream tasks share overlapping features, facilitating model fine-tuning even with limited training examples. Specifically, we introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features. During the pre-training stage, the tokens are learned jointly with top-layer deep models such as transformer. In the downstream task, tokens of the shared features are kept fixed while TabToken efficiently fine-tunes the remaining parts of the model. TabToken not only enables knowledge transfer from a pre-trained model to tasks with heterogeneous features, but also enhances the discriminative ability of deep tabular models in standard classification and regression tasks.

Unlocking the Transferability of Tokens in Deep Models for Tabular Data

TL;DR

The paper tackles transfer learning for deep tabular models when upstream and downstream feature spaces differ. It introduces TabToken, a token-centric approach that enriches feature embeddings with semantics through a Contrastive Token Regularization (CTR) loss during semantic pre-training. During fine-tuning, overlapping tokens are kept fixed while unseen feature tokens are initialized by averaging and regularized to preserve semantic structure, enabling effective transfer with limited data, formalized by the objective . Empirical results on 10 tabular datasets show strong cross-feature transfer performance and improved discriminative power for standard classification and regression tasks, highlighting the practical impact of enhancing feature token semantics in tabular deep learning.

Abstract

Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks. However, such a paradigm becomes particularly challenging with tabular data when there are discrepancies between the feature sets of pre-trained models and the target tasks. In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens (i.e., embeddings of tabular features). TabToken allows for the utilization of pre-trained models when the upstream and downstream tasks share overlapping features, facilitating model fine-tuning even with limited training examples. Specifically, we introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features. During the pre-training stage, the tokens are learned jointly with top-layer deep models such as transformer. In the downstream task, tokens of the shared features are kept fixed while TabToken efficiently fine-tunes the remaining parts of the model. TabToken not only enables knowledge transfer from a pre-trained model to tasks with heterogeneous features, but also enhances the discriminative ability of deep tabular models in standard classification and regression tasks.
Paper Structure (31 sections, 15 equations, 9 figures, 13 tables)

This paper contains 31 sections, 15 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: The toy example involves predicting ripeness based on the color of the watermelon, where ripe ones have a deep color and unripe ones are pale. Semantic distributions are more discriminative and possess the potential for transferability.
  • Figure 2: Illustrations of the token-based model, transfer task, and the procedure of TabToken. (a) The token-based models $f_0$ for tabular data can be decomposed into a feature tokenizer $h_0$ and a top-layer model $g_0$. (b) In the transfer task, the downstream dataset consists of $s$ overlapping features with the pre-training dataset while also introducing $d_t-s$ unseen features. When the feature space changes, we expect to transfer the pre-trained model for downstream tasks. (c) In the pre-training stage, by employing token combination and regularization, TabToken incorporates the semantics of labels into tokens. In the fine-tuning stage, TabToken freezes the overlapping feature tokens of the pre-trained tokenizer, efficiently fine-tuning other modules.
  • Figure 3: Averaging outperforms concatenating as it aligns distinct feature tokens, which enhances the semantics.
  • Figure 4: Feature tokens on synthetic datasets. Colors indicate which feature the tokens belong to, and the same shapes indicate semantically similar tokens. Colorless circles represent tokens of random features. (a): Categories with similar semantics across different features are not captured in the tokens. Tokens from random features may come close to other tokens that are relevant to the target, thereby influencing the prediction. (b): Tokens with similar semantics exhibit a clear clustering phenomenon, while tokens representing random features are tightly clustered together in the center.
  • Figure 5: Feature tokens trained with CTR on bank-marketing dataset. The tokens of feature "job" depict a distribution based on job types. The hierarchical pattern is in relation to the probability associated with purchasing financial products. The distribution of tokens for the feature "education" and feature "marital" aligns perfectly with their respective semantic order.
  • ...and 4 more figures