Predictability Analysis of Regression Problems via Conditional Entropy Estimations
Yu-Hsueh Fang, Chia-Yen Lee
TL;DR
The paper tackles the limitation of traditional regression metrics by framing predictability through the conditional entropy $H(Y|X)$. It develops two density-based estimators, KNIFE-P (an over-estimator) and LMC-P (an under-estimator), augmented with normalization and perturbation to reliably bound $H(Y|X)$ and thus delineate achievable performance in regression. The methodology is extended to bound the coefficient of determination $R^2$, providing practical upper and lower predictability limits under Gaussian-noise assumptions and demonstrated on both synthetic and real datasets. Empirical results show KNIFE-P and LMC-P effectively capture the admissible region for model performance, offering a tractable framework for feature-contribution analysis and benchmark-guided model development. Limitations include curse-of-dimensionality effects in high-complexity tasks, with future work aimed at more efficient estimators and handling long-tail data distributions.
Abstract
In the field of machine learning, regression problems are pivotal due to their ability to predict continuous outcomes. Traditional error metrics like mean squared error, mean absolute error, and coefficient of determination measure model accuracy. The model accuracy is the consequence of the selected model and the features, which blurs the analysis of contribution. Predictability, in the other hand, focus on the predictable level of a target variable given a set of features. This study introduces conditional entropy estimators to assess predictability in regression problems, bridging this gap. We enhance and develop reliable conditional entropy estimators, particularly the KNIFE-P estimator and LMC-P estimator, which offer under- and over-estimation, providing a practical framework for predictability analysis. Extensive experiments on synthesized and real-world datasets demonstrate the robustness and utility of these estimators. Additionally, we extend the analysis to the coefficient of determination \(R^2 \), enhancing the interpretability of predictability. The results highlight the effectiveness of KNIFE-P and LMC-P in capturing the achievable performance and limitations of feature sets, providing valuable tools in the development of regression models. These indicators offer a robust framework for assessing the predictability for regression problems.
