In-Context Data Distillation with TabPFN

Junwei Ma; Valentin Thomas; Guangwei Yu; Anthony Caterini

In-Context Data Distillation with TabPFN

Junwei Ma, Valentin Thomas, Guangwei Yu, Anthony Caterini

TL;DR

This paper introduces In-Context Distillation (ICD), a method to overcome TabPFN's ${\mathcal{O}}(n^2)$ memory barrier by learning a compact distilled context $\mathcal{D}_{dist}$ that preserves predictive information for large tabular datasets. ICD optimizes the context via backpropagation through the context tokens, without retraining the model, enabling TabPFN to operate with fixed memory while maintaining strong performance. On 48 large OpenML datasets, TabPFN-ICD significantly outperforms TabPFN and default XGBoost baselines and approaches the performance of tuned XGBoost, illustrating the efficacy of context-based data distillation for in-context learning in tabular domains. The approach thus broadens the practical applicability of TabPFN to real-world, large-scale tabular data, with potential implications for faster inference and scalable meta-learning on structured data.

Abstract

Foundation models have revolutionized tasks in computer vision and natural language processing. However, in the realm of tabular data, tree-based models like XGBoost continue to dominate. TabPFN, a transformer model tailored for tabular data, mirrors recent foundation models in its exceptional in-context learning capability, being competitive with XGBoost's performance without the need for task-specific training or hyperparameter tuning. Despite its promise, TabPFN's applicability is hindered by its data size constraint, limiting its use in real-world scenarios. To address this, we present in-context data distillation (ICD), a novel methodology that effectively eliminates these constraints by optimizing TabPFN's context. ICD efficiently enables TabPFN to handle significantly larger datasets with a fixed memory budget, improving TabPFN's quadratic memory complexity but at the cost of a linear number of tuning steps. Notably, TabPFN, enhanced with ICD, demonstrates very strong performance against established tree-based models and modern deep learning methods on 48 large tabular datasets from OpenML.

In-Context Data Distillation with TabPFN

TL;DR

This paper introduces In-Context Distillation (ICD), a method to overcome TabPFN's

memory barrier by learning a compact distilled context

that preserves predictive information for large tabular datasets. ICD optimizes the context via backpropagation through the context tokens, without retraining the model, enabling TabPFN to operate with fixed memory while maintaining strong performance. On 48 large OpenML datasets, TabPFN-ICD significantly outperforms TabPFN and default XGBoost baselines and approaches the performance of tuned XGBoost, illustrating the efficacy of context-based data distillation for in-context learning in tabular domains. The approach thus broadens the practical applicability of TabPFN to real-world, large-scale tabular data, with potential implications for faster inference and scalable meta-learning on structured data.

Abstract

Paper Structure (12 sections, 2 equations, 3 figures, 5 tables)

This paper contains 12 sections, 2 equations, 3 figures, 5 tables.

Introduction
Background and related work
In-Context Distillation
Comparison with dataset distillation wang2018dataset
Experiments
Experimental Setup
Results
Discussion and conclusion
Appendix
Dataset Details
Baseline Details
Additional Results

Figures (3)

Figure 1: Evolution of the distilled datapoints (circles with white borders, only 8 per class) and our decision boundary on a simple two double moons 2d dataset. The black line represents the decision boundary $p(y|x, \mathcal{D}_{\text{dist}})=0.5$ of TabPFN conditioned on the 16 distilled points. The shaded red (respectively blue) regions represent which class the classifier will assign to datapoints in that region. The circles with a black edge color are the test dataset. The distilled points are initialized on random training points and over time they move around in order to improve the classifier.
Figure 2: (a) Comparison between traditional and in-context data distillation. Left: Traditional data distillation methods usually consist of nested optimization loops, where the inner loop optimizes model $\theta$ for $T$ steps, then an outer loop optimizes through the inner optimization process to update $X_{dist}$ for $K$ steps. Right: In-context data distillation only requires a single loop of optimization on $X_{dist}$ for $K$ steps, eliminating the need to optimize through the inner optimization process. (b) Detailed architecture diagram of applying ICD on TabPFN. Dashed lines indicate gradient flow. Blue and red arrows represent attentions within TabPFN.
Figure 3: Log change in median AUC TabPFN-ICD and TabPFN have over XGB as a function of training set size.

In-Context Data Distillation with TabPFN

TL;DR

Abstract

In-Context Data Distillation with TabPFN

Authors

TL;DR

Abstract

Table of Contents

Figures (3)