Table of Contents
Fetching ...

LLM Embeddings Improve Test-time Adaptation to Tabular $Y|X$-Shifts

Yibo Zeng, Jiashuo Liu, Henry Lam, Hongseok Namkoong

TL;DR

This paper addresses $Y|X$-shifts in tabular data by proposing serialization-based LLM embeddings to produce informative representations for test-time adaptation with limited target labels. It combines these embeddings with optional domain information and trains a shallow neural network, evaluating multiple target-adaptation strategies (in-context domain info, full fine-tuning, LoRA, and prefix tuning) across three real-world datasets and thousands of configurations. The key findings show that LLM embeddings alone provide inconsistent robustness gains, but finetuning with as few as 32 target samples yields substantial improvements, especially under stronger $Y|X$-shifts, and that the effectiveness of domain information and sample allocation is dataset-dependent. Overall, the work demonstrates a practical, data-efficient path to improve tabular predictions under distribution shifts and offers theoretical insights linking LLM-based representations to reduced target-risk in domain adaptation.

Abstract

For tabular datasets, the change in the relationship between the label and covariates ($Y|X$-shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate $Y|X$-shifts, and propose to leverage the prior world knowledge in LLMs by serializing (write down) the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted/finetuned to the target domain even using 32 labeled observations. Our finding is based on a comprehensive and systematic study consisting of 7650 source-target pairs and benchmark against 261,000 model configurations trained by 22 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies. The code is available at https://github.com/namkoong-lab/LLM-Tabular-Shifts.

LLM Embeddings Improve Test-time Adaptation to Tabular $Y|X$-Shifts

TL;DR

This paper addresses -shifts in tabular data by proposing serialization-based LLM embeddings to produce informative representations for test-time adaptation with limited target labels. It combines these embeddings with optional domain information and trains a shallow neural network, evaluating multiple target-adaptation strategies (in-context domain info, full fine-tuning, LoRA, and prefix tuning) across three real-world datasets and thousands of configurations. The key findings show that LLM embeddings alone provide inconsistent robustness gains, but finetuning with as few as 32 target samples yields substantial improvements, especially under stronger -shifts, and that the effectiveness of domain information and sample allocation is dataset-dependent. Overall, the work demonstrates a practical, data-efficient path to improve tabular predictions under distribution shifts and offers theoretical insights linking LLM-based representations to reduced target-risk in domain adaptation.

Abstract

For tabular datasets, the change in the relationship between the label and covariates (-shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate -shifts, and propose to leverage the prior world knowledge in LLMs by serializing (write down) the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted/finetuned to the target domain even using 32 labeled observations. Our finding is based on a comprehensive and systematic study consisting of 7650 source-target pairs and benchmark against 261,000 model configurations trained by 22 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies. The code is available at https://github.com/namkoong-lab/LLM-Tabular-Shifts.

Paper Structure

This paper contains 35 sections, 1 theorem, 8 equations, 9 figures, 4 tables.

Key Result

Proposition 1

For any $\delta\in(0,1)$, with probability at least $1-\delta$, where $d_{\mathcal{H}\Delta\mathcal{H}}(\cdot,\cdot)$ denotes the $\mathcal{H}\Delta\mathcal{H}$-distance between two (marginal) distributions.

Figures (9)

  • Figure 1: Overview of methods incorporating LLM embeddings.
  • Figure 2: The FractionBest Ratio in \ref{['equ:optimal-ratio']} (with $\Delta=1\%$). We compare our proposed methods---(a)-(c): LLM$|$NN and (d)-(f): LLM$|$NN (finetuning)---with methods on Tabular features.
  • Figure 3: Shift pattern analysis. For the 2550 source$\rightarrow$target distribution shift pairs in ACS Income dataset, we attribute the performance drop for each source$\rightarrow$target pair into $Y|X$-shifts (red curve) and $X$-shifts (blue curve), and sort all pairs according to the drop introduced by $Y|X$-shifts. We draw the worst-500 settings in each dataset, and the decomposition method used here is DISDE CaiNaYa23 with XGBoost as the reference model. Results on other datasets are in \ref{['fig-appendix:overall_decomposition']}.
  • Figure 4: Average Macro F1 Score over the worst-500 settings. For each dataset, we sort the 2550 settings according to the magnitude of $Y|X$-shifts and select the worst-500 settings. We calculate the average Macro F1 Score for each method. For all methods, we select the best hyper-parameters of the basic model according to 32 samples from the target domain. We use CVaR-DRO based on NN here to represent DRO methods. For finetuning methods, we use an additional 32 target samples for finetuning; recall Section \ref{['sec: testbed setup']}.
  • Figure 5: Average Macro F1 Score over Worst-500 settings with different #target samples used for finetuning. Dotted lines represent methods that do not require finetuning, while solid lines represent finetuning methods. All three figures share the same legend, and all results use 32 labeled target samples as validation dataset.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 1