Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

Mohamed Salem

Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

Mohamed Salem

TL;DR

This article presents a practical approach to feature-level hypothesis testing that combines the Conditional Randomization Test (CRT) with TabPFN, a probabilistic foundation model for tabular data, and yields finite-sample valid p-values for conditional feature relevance, even in nonlinear and correlated settings.

Abstract

Modern machine learning models are highly expressive but notoriously difficult to analyze statistically. In particular, while black-box predictors can achieve strong empirical performance, they rarely provide valid hypothesis tests or p-values for assessing whether individual features contain information about a target variable. This article presents a practical approach to feature-level hypothesis testing that combines the Conditional Randomization Test (CRT) with TabPFN, a probabilistic foundation model for tabular data. The resulting procedure yields finite-sample valid p-values for conditional feature relevance, even in nonlinear and correlated settings, without requiring model retraining or parametric assumptions.

Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

TL;DR

Abstract

Paper Structure (27 sections, 19 equations, 2 figures, 1 table)

This paper contains 27 sections, 19 equations, 2 figures, 1 table.

Introduction
Related Work
Contributions
Problem Formulation
Methodology
Simulation and Ablation Studies
Synthetic Data Generating Processes
Linear Regimes
Linear Sparse.
Linear Dense.
Weak Signal.
Noise Block.
Correlated Features.
Nonlinear Regimes
Friedman 1.
...and 12 more sections

Figures (2)

Figure 1: Empirical cumulative distribution functions of CRT p-values for conditionally relevant and irrelevant features. Null p-values closely follow the Uniform$(0,1)$ distribution, while relevant features exhibit strong concentration near zero, indicating both valid calibration and high power.
Figure 2: Quantile–quantile plot of empirical null p-values versus the Uniform$(0,1)$ distribution. The close alignment with the diagonal is consistent with finite-sample calibration in these runs of the TabPFN-based Conditional Randomization Test across heterogeneous datasets.

Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

TL;DR

Abstract

Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

Authors

TL;DR

Abstract

Table of Contents

Figures (2)