Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction
Jie Li, Andrew McCarthy, Zhizhuo Zhang, Stephen Young
TL;DR
This paper addresses predicting biomolecule efficacy from small, heterogeneous datasets by leveraging TabPFN-based in-context learning and introducing OligoICP, an uncertainty-guided, label-free strategy for post-hoc model selection. Using a rich 574-feature representation that combines sequence encodings, trimer counts, and thermodynamics, TabPFN outperforms a state-of-the-art model on the Huesken siRNA dataset and generalizes to novel targets in few-shot settings. A key finding is that the model's inter-quantile range ($IQR$) is well-calibrated and negatively correlates with actual error, enabling effective model selection without ground-truth labels; the OligoICP ensemble further improves correlation across targets by selecting models with the lowest mean $IQR$. This approach offers a practical, scalable path to more reliable biomolecule efficacy predictions in real-world data scenarios where ground truth is scarce or unavailable. Key results demonstrate improved correlation on multiple targets and show that uncertainty-based ensemble selection can provide a meaningful upper bound toward oracle performance, with potential applicability across diverse biomolecule prediction tasks.
Abstract
In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using straightforward sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model's predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. We developed the OligoICP method, which selects and averages an ensemble of models with the lowest mean IQR for siRNA efficacy prediction, achieving superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.
