Table of Contents
Fetching ...

Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

Xumeng Wen, Shun Zheng, Zhen Xu, Yiming Sun, Jiang Bian

TL;DR

This work tackles scalable TabICL by introducing retrieval-augmented TabICL (TabRAG) that decouples context retrieval from generation. A universal, non-parametric TabRAG selects relevant in-context tabular instances, while a post-trained Phi-3 LLM (Phi3-GTL) performs predictions with extended context; alignment between TabRAG and the LLM is achieved via retrieval-guided training. Across 69 held-out datasets, TabRAG improves LLM-based TabICL and reveals distinct decision boundaries and ensemble diversity, though it generally lags the best tuned numeric models. The results underscore the promise of using language as a universal interface for scalable tabular learning and highlight retrieval engineering as a key direction for future gains. The approach demonstrates that leveraging large-scale retrieval and longer context length can unlock new capabilities for tabular data tasks within an LLM framework, with potential impact across domains and interfaces.

Abstract

Recent studies have shown that large language models (LLMs), when customized with post-training on tabular data, can acquire general tabular in-context learning (TabICL) capabilities. These models are able to transfer effectively across diverse data schemas and different task domains. However, existing LLM-based TabICL approaches are constrained to few-shot scenarios due to the sequence length limitations of LLMs, as tabular instances represented in plain text consume substantial tokens. To address this limitation and enable scalable TabICL for any data size, we propose retrieval-augmented LLMs tailored to tabular data. Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs. This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets and demonstrating promising scaling behavior. Extensive comparisons with state-of-the-art tabular models reveal that, while LLM-based TabICL still lags behind well-tuned numeric models in overall performance, it uncovers powerful algorithms under limited contexts, enhances ensemble diversity, and excels on specific datasets. These unique properties underscore the potential of language as a universal and accessible interface for scalable tabular data learning.

Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

TL;DR

This work tackles scalable TabICL by introducing retrieval-augmented TabICL (TabRAG) that decouples context retrieval from generation. A universal, non-parametric TabRAG selects relevant in-context tabular instances, while a post-trained Phi-3 LLM (Phi3-GTL) performs predictions with extended context; alignment between TabRAG and the LLM is achieved via retrieval-guided training. Across 69 held-out datasets, TabRAG improves LLM-based TabICL and reveals distinct decision boundaries and ensemble diversity, though it generally lags the best tuned numeric models. The results underscore the promise of using language as a universal interface for scalable tabular learning and highlight retrieval engineering as a key direction for future gains. The approach demonstrates that leveraging large-scale retrieval and longer context length can unlock new capabilities for tabular data tasks within an LLM framework, with potential impact across domains and interfaces.

Abstract

Recent studies have shown that large language models (LLMs), when customized with post-training on tabular data, can acquire general tabular in-context learning (TabICL) capabilities. These models are able to transfer effectively across diverse data schemas and different task domains. However, existing LLM-based TabICL approaches are constrained to few-shot scenarios due to the sequence length limitations of LLMs, as tabular instances represented in plain text consume substantial tokens. To address this limitation and enable scalable TabICL for any data size, we propose retrieval-augmented LLMs tailored to tabular data. Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs. This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets and demonstrating promising scaling behavior. Extensive comparisons with state-of-the-art tabular models reveal that, while LLM-based TabICL still lags behind well-tuned numeric models in overall performance, it uncovers powerful algorithms under limited contexts, enhances ensemble diversity, and excels on specific datasets. These unique properties underscore the potential of language as a universal and accessible interface for scalable tabular data learning.

Paper Structure

This paper contains 42 sections, 3 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: We investigate the effects of increasing the number of training instances ($|D_{\text{train}}^{T'}|$) and the number of in-context instances per test example ($N^C$) on the TabICL performance of Phi3-GTL models. In each subplot, we compare the scaling effects of two Phi3-GTL models with different retrieval policies: one that randomly selects in-context instances, denoted as "Random," and the other employing our default TabRAG module, denoted as "RAG". We use violin plots to visualize the performance distribution across multiple held-out datasets. Additionally, dashed lines are used to emphasize that the median prediction error of our approach follows a power-law relationship with the number of training instances.
  • Figure 2: An overall performance comparison of all models. In the left subplot, we use violin plots to show the AUROC scores of different models across 29 classification tasks, while the right subplot displays the NMAE scores for 40 regression tasks. Models are sorted by their median metric score across the held-out datasets, with dashed lines indicating these median scores in each subplot. Our approach, RAG+Phi3-GTL, is prefixed with a marker (*), for quick identification.
  • Figure 3: Ensemble performance comparisons of RAG+Phi3-GTL, TabPFN-v2, LightGBM, and CatBoost are presented, where normalized AUROC or NMAE scores (min-max normalized across methods for each dataset) are plotted to highlight their relative strengths across multiple datasets, while omitting absolute metric differences.
  • Figure 4: Per-dataset performance comparisons between RAG+Phi3-GTL and the two most competitive baselines, TabPFN-v2 and CatBoost, are presented, with dataset IDs sorted by performance gaps. Dashed lines and annotations are used to indicate the proportion of datasets where RAG+Phi3-GTL outperforms these baselines and where it significantly lags behind.
  • Figure 5: Decision boundary comparisons of various models, where each row corresponds to a specific set of training instances generated from a given data distribution. The first column visualizes these training instances, while the subsequent columns illustrate the decision boundaries of different models. The top two rows represent the same data distribution but with varying numbers of training instances, whereas the bottom two rows depict a different data distribution.
  • ...and 11 more figures