A Multi-Modal Deep Learning Based Approach for House Price Prediction
Md Hasebul Hasan, Md Abid Jahan, Mohammed Eunus Ali, Yuan-Fang Li, Timos Sellis
TL;DR
The paper tackles the challenge of house price prediction by integrating diverse data modalities from real estate listings. It introduces the Multi-Modal House Price Predictor (MHPP), which learns joint embeddings from geo-spatial context (GSNE), textual descriptions (SBERT), and house images (CLIP) alongside raw features, then feeds a downstream regressor to predict prices. Experimental results on a Melbourne dataset show that incorporating text and image embeddings with geo-spatial and raw features yields substantial accuracy gains, with the best setup achieving notable reductions in MAE and RMSE across several regression models. The work demonstrates the practical value of multi-modal representations for real estate analytics and provides a publicly available codebase and dataset for reproducibility.
Abstract
Accurate prediction of house price, a vital aspect of the residential real estate sector, is of substantial interest for a wide range of stakeholders. However, predicting house prices is a complex task due to the significant variability influenced by factors such as house features, location, neighborhood, and many others. Despite numerous attempts utilizing a wide array of algorithms, including recent deep learning techniques, to predict house prices accurately, existing approaches have fallen short of considering a wide range of factors such as textual and visual features. This paper addresses this gap by comprehensively incorporating attributes, such as features, textual descriptions, geo-spatial neighborhood, and house images, typically showcased in real estate listings in a house price prediction system. Specifically, we propose a multi-modal deep learning approach that leverages different types of data to learn more accurate representation of the house. In particular, we learn a joint embedding of raw house attributes, geo-spatial neighborhood, and most importantly from textual description and images representing the house; and finally use a downstream regression model to predict the house price from this jointly learned embedding vector. Our experimental results with a real-world dataset show that the text embedding of the house advertisement description and image embedding of the house pictures in addition to raw attributes and geo-spatial embedding, can significantly improve the house price prediction accuracy. The relevant source code and dataset are publicly accessible at the following URL: https://github.com/4P0N/mhpp
