Retrieval & Fine-Tuning for In-Context Tabular Models
Valentin Thomas, Junwei Ma, Rasa Hosseinzadeh, Keyvan Golestan, Guangwei Yu, Maksims Volkovs, Anthony Caterini
TL;DR
This work addresses the scalability gap of transformer-based in-context learning for tabular data by introducing LoCalPFN, a framework that combines kNN retrieval of local neighbours with end-to-end fine-tuning on retrieved samples on top of the TabPFN base. The approach yields state-of-the-art performance across 95 TabZilla/OpenML datasets, outperforming both neural baselines and strongly-tuned tree-based methods, especially on larger and more complex tasks. By demonstrating that local context and joint retrieval and fine-tuning can substantially improve tabular ICL, the paper advances practical deep learning capabilities for tabular domains. The findings highlight the potential of retrieval-augmented, locally calibrated transformers to scale tabular deep learning while maintaining favorable performance and offering new avenues for future tabular foundation models.
Abstract
Tabular data is a pervasive modality spanning a wide range of domains, and the inherent diversity poses a considerable challenge for deep learning. Recent advancements using transformer-based in-context learning have shown promise on smaller and less complex datasets, but have struggled to scale to larger and more complex ones. To address this limitation, we propose a combination of retrieval and fine-tuning: we can adapt the transformer to a local subset of the data by collecting nearest neighbours, and then perform task-specific fine-tuning with this retrieved set of neighbours in context. Using TabPFN as the base model -- currently the best tabular in-context learner -- and applying our retrieval and fine-tuning scheme on top results in what we call a locally-calibrated PFN, or LoCalPFN. We conduct extensive evaluation on 95 datasets curated by TabZilla from OpenML, upon which we establish a new state-of-the-art with LoCalPFN -- even with respect to tuned tree-based models. Notably, we show a significant boost in performance compared to the base in-context model, demonstrating the efficacy of our approach and advancing the frontier of deep learning in tabular data.
