VisTabNet: Adapting Vision Transformers for Tabular Data
Witold Wydmański, Ulvi Movsum-zada, Jacek Tabor, Marek Śmieja
TL;DR
VisTabNet tackles the challenge of leveraging large pre-trained vision models for tabular data by introducing a cross-modal transfer mechanism that maps tabular inputs into the ViT patch-embedding space via an adaptation network. The ViT encoder is kept largely fixed, with a lightweight tabular head, enabling effective learning from datasets with fewer than 1,000 samples. Across diverse small-tabular benchmarks, VisTabNet consistently outperforms traditional ensembles and several deep baselines, demonstrating the value of transferring middle-layer representations from image models to tabular domains. The approach also shows favorable few-shot behavior and practical accessibility through open-source tooling and a scikit-learn-like interface, broadening the applicability of transfer learning in tabular tasks.
Abstract
Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet -- a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet's superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning. We share our example implementation as a GitHub repository available at https://github.com/wwydmanski/VisTabNet.
