Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation
Alexey S. Tanashkin, Irina G. Tanashkina, Alexander S. Maksimchuik
TL;DR
The paper tackles the problem of building interpretable machine learning models for mass cadastral valuation in Primorsky Krai, Russia, focusing on land parcels and flats. It proposes a data-rich workflow that includes robust outlier handling, spatial feature engineering (notably a road-network centrality measure computed from OpenStreetMap data), and feature selection to manage multicollinearity. For land parcels, an interpretable regression-kriging framework (OLS plus kriging residuals) is shown to outperform a plain linear model, while for flats, the RuleFit method combines decision-rule generation with sparse linear modeling to maintain interpretability and achieve strong predictive accuracy, sometimes comparable to Random Forest. The results demonstrate that hybrid, interpretable approaches can achieve competitive performance in real estate valuation and offer practical benefits for policy and legal applications, with potential to generalize to other regions pending further data and methodological refinement. Key metrics include $R^2_{adj}$ and $MAPE$ values such as $R^2_{adj}=0.760$ and $MAPE=19.23\%$ for land parcels (regression-kriging) and $R^2_{adj}=0.613$, $MAPE=8.8\%$ for Flats (RuleFit).
Abstract
In this paper, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. There are numerous potential difficulties one could encounter in the effort to build a good model. Their main source is the huge difference between noisy real market data and ideal data usually used in tutorials on machine learning. This paper covers all stages of modeling: collection of initial data, identification of outliers, search and analysis of patterns in the data, formation and final choice of price factors, building of the model, and evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with kriging (interpolation method of geostatistics) allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point, the application of geostatistical methods becomes problematic. Instead, we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. We compare the performance of our inherently interpretable models with well-proven "black-box" Random Forest method and demonstrate similar results. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.
