Mixture of In-Context Prompters for Tabular PFNs

Derek Xu; Olcay Cirit; Reza Asadi; Yizhou Sun; Wei Wang

Mixture of In-Context Prompters for Tabular PFNs

Derek Xu, Olcay Cirit, Reza Asadi, Yizhou Sun, Wei Wang

TL;DR

MixturePFN tackles the core scalability bottleneck of PFN-based ICL for tabular data by introducing MICP, a sparse routing mechanism that assigns test samples to specialized prompters with small, fixed prompts, reducing inference cost from $O(N_{train}^2)$ memory/time to $O(1)$ memory and $O(\,\log N_{train}\,)$ time, respectively. To further boost performance and alignment with inference-time data, CaPFN finetunes the frozen PFN using bootstrapped prompts via adapters, capturing the downstream dataset distribution without full fine-tuning. Empirically, MixturePFN achieves state-of-the-art results on the TabZilla benchmark across 36 datasets and 19 baselines, with Condorcet-winning performance and statistically significant gains, and demonstrates robust scalability across dataset sizes and irregularities. Overall, the method provides a scalable, high-performing framework for tabular ICL, enabling strong generalization and practical applicability in real-world, large-scale tabular datasets. $\text{MixturePFN}$ thus establishes a new standard for scalable, context-aware prompting in tabular learning, balancing efficiency and accuracy through a principled routing and bootstrapping approach.

Abstract

Recent benchmarks found In-Context Learning (ICL) outperforms both deep learning and tree-based algorithms on small tabular datasets. However, on larger datasets, ICL for tabular learning cannot run without severely compromising performance, due to its quadratic space and time complexity w.r.t. dataset size. We propose MIXTUREPFN, which both extends nearest-neighbor sampling to the state-of-the-art ICL for tabular learning model and uses bootstrapping to finetune said model on the inference-time dataset. MIXTUREPFN is the Condorcet winner across 36 diverse tabular datasets against 19 strong deep learning and tree-based baselines, achieving the highest mean rank among Top-10 aforementioned algorithms with statistical significance.

Mixture of In-Context Prompters for Tabular PFNs

TL;DR

memory/time to

memory and

time, respectively. To further boost performance and alignment with inference-time data, CaPFN finetunes the frozen PFN using bootstrapped prompts via adapters, capturing the downstream dataset distribution without full fine-tuning. Empirically, MixturePFN achieves state-of-the-art results on the TabZilla benchmark across 36 datasets and 19 baselines, with Condorcet-winning performance and statistically significant gains, and demonstrates robust scalability across dataset sizes and irregularities. Overall, the method provides a scalable, high-performing framework for tabular ICL, enabling strong generalization and practical applicability in real-world, large-scale tabular datasets.

thus establishes a new standard for scalable, context-aware prompting in tabular learning, balancing efficiency and accuracy through a principled routing and bootstrapping approach.

Abstract

Paper Structure (52 sections, 1 theorem, 2 equations, 16 figures, 14 tables)

This paper contains 52 sections, 1 theorem, 2 equations, 16 figures, 14 tables.

Introduction
Preliminaries
Prior Fitted Networks
Pretraining
Inference
Fundamental Scalability Limitations
PFN-Style Batching
Method
Support Set Approximation for Scalable PFNs
Mixture of In-Context Prompters (MICP)
Router and Prompter Initialization
Efficiency and Effectiveness Trade-Off
Context-Aware Finetuning (CaPFN)
Bootstrapping Large MICP Datasets
Bootstrapping Small Datasets
...and 37 more sections

Key Result

Theorem 1

If every K-Means cluster contains at most $B$ samples, $|D_{cluster}^{(k)}| \leq B$$\forall k\in[0,...,K-1]$ and training points route to their assigned K-Means cluster $\mathcal{R}^*(x_{train}^{(i)}) = k:x_{train}^{(i)} \in D_{cluster}^{(k)}$Thse conditions can be satisfied via constrained K-Means

Figures (16)

Figure 1: We highlight the differences between In-Context Learning (ICL) on Prior Fitted Networks (ex. TabPFN), left, and Large Language Models (LLMs), right. TabPFN treats training data as tokens (where each token is a concatenation of feature and label), whereas LLMs use templates to convert training data into natural language prompts. TabPFN uses an attention pattern (blue and red arrows) supporting batch inference, whereas LLMs use generic encoder-decoder or decoder-only setups. TabPFN are pretrained on Equation \ref{['eqn:pfn_loss']}, whereas LLMs are pretrained on a separate objective.
Figure 2: Illustration of MixturePFN. MICP (Left): New test samples are passed to a router that picks 1 out of $K$ prompters to form a scalable "prompt" with $B$ training samples for the downstream PFN model. CaPFN (Right): TabPFN is frozen, fitted with adapters, then finetuned using data prior negative loss likelihood, Equation \ref{['eqn:pfn_loss']}, on our bootstrapped data prior, $p(D|D_{train})$. This prior simulates the MICP inference mechanism. The finetuned model is called CaPFN.
Figure 3: (a): We plot the difference in Log Likelihood between MixturePFN and TabPFN* for each dataset of size $N_{train}$. MixturePFN substantially improves the performance and TabPFN* and runs on datasets with $>3,000$ samples. (b): We plot the Log Likelihood of the top deep learning (DL) PFN, and tree baselines across all 36 datasets and the best-fit line between rank and dataset size, compared to the top baseline. Unlike TabPFN, MixturePFN maintains its good performance as the size of the dataset increases. (c) : We plot the best among the top DL, PFN, and tree baselines on all 36 datasets across different dataset properties. MixturePFN performs well across different dataset irregularities. We provide further breakdowns in the Appendix.
Figure 4: Wilcoxon-Signed Rank Test shows MixturePFN significantly outperforms the Top-10 baselines on the 22 shared datasets. To break ties, we rank algorithms based on their mean log-likelihoods following TabZillamcelfresh2023neural. We compute the rank across all 10 cross-validation splits. We report additional critical difference diagrams in the Appendix.
Figure 5: Pairwise comparison matrix for Condorcet voting over the log likelihood metric. Note, MixturePFN is the Condorcet winner.
...and 11 more figures

Theorems & Definitions (1)

Theorem 1: Nonzero Overlap

Mixture of In-Context Prompters for Tabular PFNs

TL;DR

Abstract

Mixture of In-Context Prompters for Tabular PFNs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (1)