In-Context Data Distillation with TabPFN
Junwei Ma, Valentin Thomas, Guangwei Yu, Anthony Caterini
TL;DR
This paper introduces In-Context Distillation (ICD), a method to overcome TabPFN's ${\mathcal{O}}(n^2)$ memory barrier by learning a compact distilled context $\mathcal{D}_{dist}$ that preserves predictive information for large tabular datasets. ICD optimizes the context via backpropagation through the context tokens, without retraining the model, enabling TabPFN to operate with fixed memory while maintaining strong performance. On 48 large OpenML datasets, TabPFN-ICD significantly outperforms TabPFN and default XGBoost baselines and approaches the performance of tuned XGBoost, illustrating the efficacy of context-based data distillation for in-context learning in tabular domains. The approach thus broadens the practical applicability of TabPFN to real-world, large-scale tabular data, with potential implications for faster inference and scalable meta-learning on structured data.
Abstract
Foundation models have revolutionized tasks in computer vision and natural language processing. However, in the realm of tabular data, tree-based models like XGBoost continue to dominate. TabPFN, a transformer model tailored for tabular data, mirrors recent foundation models in its exceptional in-context learning capability, being competitive with XGBoost's performance without the need for task-specific training or hyperparameter tuning. Despite its promise, TabPFN's applicability is hindered by its data size constraint, limiting its use in real-world scenarios. To address this, we present in-context data distillation (ICD), a novel methodology that effectively eliminates these constraints by optimizing TabPFN's context. ICD efficiently enables TabPFN to handle significantly larger datasets with a fixed memory budget, improving TabPFN's quadratic memory complexity but at the cost of a linear number of tuning steps. Notably, TabPFN, enhanced with ICD, demonstrates very strong performance against established tree-based models and modern deep learning methods on 48 large tabular datasets from OpenML.
