MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

Wall Kim; Chaeyoung Song; Hanul Kim

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

Wall Kim, Chaeyoung Song, Hanul Kim

TL;DR

The Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner, and introduces a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning.

Abstract

Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at https://github.com/too-z/MultiModalPFN.

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

TL;DR

Abstract

Paper Structure (38 sections, 3 equations, 7 figures, 7 tables)

This paper contains 38 sections, 3 equations, 7 figures, 7 tables.

Introduction
Related works
Vision--Language Multimodal Models
Tabular and Multimodal Models
General-Purpose Pre-trained Models
Proposed Method
Preliminary: TabPFN
Multimodal PFN: Architecture
Per-Modality Encoders
Modality Projector
Multimodal PFN: Training
Attention Imbalance in MMPFN
Experiments
Experimental Setup
Dataset.
...and 23 more sections

Figures (7)

Figure 1: An overview of MMPFN. MMPFN extends TabPFN by incorporating per-modality encoders and a modality projector to extract features from non-tabular data. Newly developed components are highlighted in color, while existing ones appear in gray. Layers marked as ‘frozen’ remain fixed during fine-tuning, whereas all others are trainable. Encoded target labels are part of the training inputs but are omitted from the diagram for clarity.
Figure 2: Performance on PU20 and Cloth as a function of the number of non-tabular features. (a) Results from non-tabular–only experiments using DINOv2/ELECTRA with an MLP baseline. (b) Results from multi-modal token imbalance experiments. In both settings, the y-axis denotes accuracy, while the x-axis corresponds to the number of non-tabular input features. For (b), when applying MGM+CAP, the number of CAP heads was fixed at 24 and 4, respectively.
Figure 3: Cosine similarity between multimodal feature embeddings. Axes denote all tabular and text/image features. From left to right and top to bottom, it shows the correlations between features in the experiments on the PU20, Calc, Cloth, Mass, Petfinder, and Airbnb datasets.
Figure 4: Accuracy of AutoGluon vs. MMPFN on PetFinder under different modality combinations: tabular, +text, +image, +image+text.
Figure S1: Effect of the ratio between tabular and non-tabular features. The black dots and vertical lines show the mean and variance across five random seeds. Darker black dots correspond to a larger number of MGM heads (i.e., more non-tabular features generated by MGM), ranging from 8 to 128. The red dot indicates the average result across all MGM-head settings. The x-axis shows the number of non-tabular features generated by CAP, and the y-axis denotes accuracy. The blue line represents the number of tabular features. Dataset names are shown above each subfigure.
...and 2 more figures

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

TL;DR

Abstract

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)