TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
Christopher Kolberg, Katharina Eggensperger, Nico Pfeifer
TL;DR
The paper addresses the challenge of high-dimensional, low-sample (HDLSS) tabular data in biomedicine by extending an existing tabular foundation model (TabPFNv2) through continued pre-training on synthetic HDLSS data. It introduces a causal prior-based HDLSS data generation process with feature widening to tens of thousands of features and demonstrates that TabPFN-Wide maintains or improves predictive performance while remaining robust to noise and avoiding feature reduction. A key finding is that feature-wise attention scores correlate with predictive importance, enabling inherent interpretability, including biologically relevant gene signals in cancer datasets. Overall, TabPFN-Wide represents a scalable, interpretable approach for HDLSS tabular data and opens avenues for adapting other foundation models to extreme feature counts, with practical impact for biomedical discovery and precision medicine.
Abstract
Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional machine learning approaches. While prior-data fitted networks emerge as foundation models for tabular data, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance while exhibiting improved robustness to noise. It seamlessly scales beyond 50,000 features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results show that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world biomedical datasets many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
