Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 1.3 Applied ML 1

Machine learning and wealth measurement : an experiment on housing wealth of French households

Olivier Meslin* 1, Mathias André1

Abstract

Accurately measuring wealth, wealth inequality and its evolution is of considerable importance for researchers, policymakers and the general public, particularly in a context of rising inequality. Data on wealth, however, is patchy and sometimes unreliable, particularly at the top of the distribution. This project aims at improving on available data sources by introducing a new database on French households' housing wealth, based on administrative data.

This proposal focuses on one part of the project: building on the growing literature on the potential uses of machine learning techniques by statisticians and economists (Varian 2014, Mullainathan et al. 2017, Athey et al. 2019), we demonstrate how machine learning algorithms can be used to predict the market value of all privately-held dwellings at a country-wide scale.

This project shows that machine learning methods bring significant improvements to the statistician's toolbox, but also that using ML algorithms in official statistics is not a straightforward application of existing algorithms. Instead, it requires a careful adaptation of these algorithms to use cases they were not initially designed for. We intend to explain how we used ML algorithms to overcome three challenges:

  • Dense areas are typically characterized by large price variations with respect to location, whereas the impact of location is much smoother in rural areas. In other words, the scale of the relevant housing market varies considerably over the country. How can a model accurately account for this heterogeneity of market scales?
  • Areas with frequent real estate transactions (mostly large cities) are overrepresented in the training data, whereas areas with significant numbers of dwellings but infrequent transactions (mostly rural areas) are underrepresented. How can we be sure that a model will perform reasonably well on all areas, even underrepresented ones?
  • Algorithms may perform poorly on luxury properties, as these properties account for a very small share of training data. However, reliable market value estimates of these properties matter very much for the estimation of housing wealth. How can we be sure that the model performs well on luxury dwellings?

*: Speaker

1: Insee - France