VisTabNet: Adapting Vision Transformers for Tabular Data

Witold Wydmański; Ulvi Movsum-zada; Jacek Tabor; Marek Śmieja

VisTabNet: Adapting Vision Transformers for Tabular Data

Witold Wydmański, Ulvi Movsum-zada, Jacek Tabor, Marek Śmieja

TL;DR

VisTabNet tackles the challenge of leveraging large pre-trained vision models for tabular data by introducing a cross-modal transfer mechanism that maps tabular inputs into the ViT patch-embedding space via an adaptation network. The ViT encoder is kept largely fixed, with a lightweight tabular head, enabling effective learning from datasets with fewer than 1,000 samples. Across diverse small-tabular benchmarks, VisTabNet consistently outperforms traditional ensembles and several deep baselines, demonstrating the value of transferring middle-layer representations from image models to tabular domains. The approach also shows favorable few-shot behavior and practical accessibility through open-source tooling and a scikit-learn-like interface, broadening the applicability of transfer learning in tabular tasks.

Abstract

Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet -- a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet's superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning. We share our example implementation as a GitHub repository available at https://github.com/wwydmanski/VisTabNet.

VisTabNet: Adapting Vision Transformers for Tabular Data

TL;DR

Abstract

Paper Structure (32 sections, 2 equations, 7 figures, 4 tables)

This paper contains 32 sections, 2 equations, 7 figures, 4 tables.

Introduction
Related Work
VisTabNet model
Vision Transformer architecture
Transferability
Cross-modal transfer of ViT
Experiments
Tabular Data Classification
Datasets and baselines
Experimental Setup
Results
Backbone selection
Fine-tuning Progress:
Transformer architectures
Few-shot transfer learning
...and 17 more sections

Figures (7)

Figure 1: Data flow architecture in VisTabNet. The tabular input is transformed into the image embedding space via our adaptation layer. After processing with pre-trained Transformer, the data is then classified using an MLP head.
Figure 2: Comparison of average ranking with standard deviation as whiskers (the lower the better).
Figure 3: Comparison of the MCC score distributions (the higher the better).
Figure 4: Matthews score progress over epochs include fine-tuning epochs for six datasets: ZOO, Dermatology, Credit Approval, Libras, and Volkert
Figure 5: Average MCC across 5 datasets depends on the selection of both projection depth and number of classification layers.
...and 2 more figures

VisTabNet: Adapting Vision Transformers for Tabular Data

TL;DR

Abstract

VisTabNet: Adapting Vision Transformers for Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)