Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 1.3 Applied ML 1

Extracting meaningful information from web data on real estate – challenges and experiences

Dominik Dabrowski* 1, Bartosz Grancow* 1, Klaudia Peszat1


This paper presents the challenges related to applying Natural Language Processes (NLP) to web data from real estate on-line offers, in order to extract information, which could complement official statistics. The study is part of the experimental stream of the ESSnet Web Intelligence Network project, whose aim is to explore the possibility to produce new-, and augment existing statistics via the European platform - Web Intelligence Hub (WIH).

The information acquired from web data may be used to monitor the trends on the real estate market in a timelier manner and provide new indicators. On-line real estate sales and rental offers cover a wide range of additional information on, e.g. characteristics of the building, property surrounding areas, elements of amenities available in the property, etc., which can be used in a variety of statistical domains. However, extracting valuable information from these highly unstructured data and provide adequate input to the machine learning classification models is a considerable challenge.

This paper explores the application of NLP techniques in automatic classification of the real estate online data, according to selected variables. It looks into methodological questions, the issues related to manual classification, and building automatic classifiers. Finally, the paper discusses the possibility to apply these solutions in official statistics, in order to provide input for one of the representatives in price statistics.

*: Speaker

1: Statistics Poland