From Predictive Importance to Causality: Which Machine Learning Model Reflects Reality?
Muhammad Arbab Arshad, Pallavi Kandanur, Saurabh Sonawani, Laiba Batool, Muhammad Umar Habib
TL;DR
The paper tackles aligning predictive importance with causal drivers in real estate by analyzing the Ames Housing Dataset with CatBoost and LightGBM. It combines SHAP-based feature importance with EconML causal inference, including ATE estimation, heterogeneity analyses, policy-tree interpretation, and what-if simulations. CatBoost achieves higher test accuracy (89.91) and shows stronger alignment with causally significant features ($\rho=0.48$) than LightGBM ($\rho=0.35$). A practical takeaway is that porch-related features exhibit context-dependent causal effects, with what-if analyses showing price gains (e.g., from $146,936.89$ to $149,649.98$), underscoring the value of integrated predictive-causal frameworks for real estate valuation and policy guidance.
Abstract
This study analyzes the Ames Housing Dataset using CatBoost and LightGBM models to explore feature importance and causal relationships in housing price prediction. We examine the correlation between SHAP values and EconML predictions, achieving high accuracy in price forecasting. Our analysis reveals a moderate Spearman rank correlation of 0.48 between SHAP-based feature importance and causally significant features, highlighting the complexity of aligning predictive modeling with causal understanding in housing market analysis. Through extensive causal analysis, including heterogeneity exploration and policy tree interpretation, we provide insights into how specific features like porches impact housing prices across various scenarios. This work underscores the need for integrated approaches that combine predictive power with causal insights in real estate valuation, offering valuable guidance for stakeholders in the industry.
