Table of Contents
Fetching ...

Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation

Alexey S. Tanashkin, Irina G. Tanashkina, Alexander S. Maksimchuik

TL;DR

The paper tackles the problem of building interpretable machine learning models for mass cadastral valuation in Primorsky Krai, Russia, focusing on land parcels and flats. It proposes a data-rich workflow that includes robust outlier handling, spatial feature engineering (notably a road-network centrality measure computed from OpenStreetMap data), and feature selection to manage multicollinearity. For land parcels, an interpretable regression-kriging framework (OLS plus kriging residuals) is shown to outperform a plain linear model, while for flats, the RuleFit method combines decision-rule generation with sparse linear modeling to maintain interpretability and achieve strong predictive accuracy, sometimes comparable to Random Forest. The results demonstrate that hybrid, interpretable approaches can achieve competitive performance in real estate valuation and offer practical benefits for policy and legal applications, with potential to generalize to other regions pending further data and methodological refinement. Key metrics include $R^2_{adj}$ and $MAPE$ values such as $R^2_{adj}=0.760$ and $MAPE=19.23\%$ for land parcels (regression-kriging) and $R^2_{adj}=0.613$, $MAPE=8.8\%$ for Flats (RuleFit).

Abstract

In this paper, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. There are numerous potential difficulties one could encounter in the effort to build a good model. Their main source is the huge difference between noisy real market data and ideal data usually used in tutorials on machine learning. This paper covers all stages of modeling: collection of initial data, identification of outliers, search and analysis of patterns in the data, formation and final choice of price factors, building of the model, and evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with kriging (interpolation method of geostatistics) allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point, the application of geostatistical methods becomes problematic. Instead, we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. We compare the performance of our inherently interpretable models with well-proven "black-box" Random Forest method and demonstrate similar results. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.

Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation

TL;DR

The paper tackles the problem of building interpretable machine learning models for mass cadastral valuation in Primorsky Krai, Russia, focusing on land parcels and flats. It proposes a data-rich workflow that includes robust outlier handling, spatial feature engineering (notably a road-network centrality measure computed from OpenStreetMap data), and feature selection to manage multicollinearity. For land parcels, an interpretable regression-kriging framework (OLS plus kriging residuals) is shown to outperform a plain linear model, while for flats, the RuleFit method combines decision-rule generation with sparse linear modeling to maintain interpretability and achieve strong predictive accuracy, sometimes comparable to Random Forest. The results demonstrate that hybrid, interpretable approaches can achieve competitive performance in real estate valuation and offer practical benefits for policy and legal applications, with potential to generalize to other regions pending further data and methodological refinement. Key metrics include and values such as and for land parcels (regression-kriging) and , for Flats (RuleFit).

Abstract

In this paper, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. There are numerous potential difficulties one could encounter in the effort to build a good model. Their main source is the huge difference between noisy real market data and ideal data usually used in tutorials on machine learning. This paper covers all stages of modeling: collection of initial data, identification of outliers, search and analysis of patterns in the data, formation and final choice of price factors, building of the model, and evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with kriging (interpolation method of geostatistics) allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point, the application of geostatistical methods becomes problematic. Instead, we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. We compare the performance of our inherently interpretable models with well-proven "black-box" Random Forest method and demonstrate similar results. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.

Paper Structure

This paper contains 7 sections, 10 equations, 21 figures, 11 tables.

Figures (21)

  • Figure 1: Primorsky Krai (highlighted in red) on the map of the Asia-Pacific region. Countries geographically located in this region are highlighted in purple. Map lines do not necessarily depict accepted national boundaries.
  • Figure 2: Fluctuation of the market per square meter price (in RUB) for the segment “Land parcels”. Significantly different prices of adjacent land parcels are marked by red ovals to guide the eye. The values in filled circles show the number of objects for a particular area. The triangular symbols represent hills and are added to the map automatically as part of the standard OpenStreetMap OSM layout.
  • Figure 3: Comparison of median values of the per square meter price for the city and the suburbs (segment “Land parcels”). The values in filled circles show the number of objects for a particular area. The inset shows the estimations of the probability density functions (PDF) for PSMP. The triangular symbols represent hills and are added to the map automatically as part of the standard OpenStreetMap layout.
  • Figure 4: Comparison of probability density function for the offers and the deals (segment “Flats”, the whole dataset). The suspicious range of prices is shown by a dotted red circle.
  • Figure 5: Results of the RANSAC regression for one of the subclusters of similar buildings formed by DBSCAN (segment “Flats”).
  • ...and 16 more figures