Table of Contents
Fetching ...

Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction

Jie Li, Andrew McCarthy, Zhizhuo Zhang, Stephen Young

TL;DR

This paper addresses predicting biomolecule efficacy from small, heterogeneous datasets by leveraging TabPFN-based in-context learning and introducing OligoICP, an uncertainty-guided, label-free strategy for post-hoc model selection. Using a rich 574-feature representation that combines sequence encodings, trimer counts, and thermodynamics, TabPFN outperforms a state-of-the-art model on the Huesken siRNA dataset and generalizes to novel targets in few-shot settings. A key finding is that the model's inter-quantile range ($IQR$) is well-calibrated and negatively correlates with actual error, enabling effective model selection without ground-truth labels; the OligoICP ensemble further improves correlation across targets by selecting models with the lowest mean $IQR$. This approach offers a practical, scalable path to more reliable biomolecule efficacy predictions in real-world data scenarios where ground truth is scarce or unavailable. Key results demonstrate improved correlation on multiple targets and show that uncertainty-based ensemble selection can provide a meaningful upper bound toward oracle performance, with potential applicability across diverse biomolecule prediction tasks.

Abstract

In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using straightforward sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model's predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. We developed the OligoICP method, which selects and averages an ensemble of models with the lowest mean IQR for siRNA efficacy prediction, achieving superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.

Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction

TL;DR

This paper addresses predicting biomolecule efficacy from small, heterogeneous datasets by leveraging TabPFN-based in-context learning and introducing OligoICP, an uncertainty-guided, label-free strategy for post-hoc model selection. Using a rich 574-feature representation that combines sequence encodings, trimer counts, and thermodynamics, TabPFN outperforms a state-of-the-art model on the Huesken siRNA dataset and generalizes to novel targets in few-shot settings. A key finding is that the model's inter-quantile range () is well-calibrated and negatively correlates with actual error, enabling effective model selection without ground-truth labels; the OligoICP ensemble further improves correlation across targets by selecting models with the lowest mean . This approach offers a practical, scalable path to more reliable biomolecule efficacy predictions in real-world data scenarios where ground truth is scarce or unavailable. Key results demonstrate improved correlation on multiple targets and show that uncertainty-based ensemble selection can provide a meaningful upper bound toward oracle performance, with potential applicability across diverse biomolecule prediction tasks.

Abstract

In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using straightforward sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model's predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. We developed the OligoICP method, which selects and averages an ensemble of models with the lowest mean IQR for siRNA efficacy prediction, achieving superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.

Paper Structure

This paper contains 10 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Using inter-quantile ranges (IQRs) to evaluate prediction quality. (a) Distribution of mean absolute error (MAE) categorized by (IQR) on randomly chosen held-out test set. Higher IQR is associated with higher error. (b) Scatter plot showing the negative correlation between a model's performance (correlation coefficient) and its mean prediction IQR on the Target1 (A) dataset. The dotted line shows the best linear fit ($r=-0.42$).
  • Figure 2: Schema of the siRNA and mRNA taken as input to the TabPFN model
  • Figure 3: Relationship between empirical coverage and the expected quantile range provided to the model. The close adherence to the identity line ($y=x$) demonstrates that the model is well-calibrated.