Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 4.3 Methology II

Machine learning for model-assisted estimation in survey sampling: bridging the rigor of statistical inference with the power of machine learning

Boriska Toth* 1

Abstract

The use of machine learning in statistics production is being explored widely, with applications including coding, outlier detection, and imputing missing values. Relatively little work has so far focused on one of the most central application areas for NSI’s: replacing the widespread use of simple, typically linear, models for model-assisted estimation in survey sampling with machine learning methods. These assisting models adjust for available covariates in estimating a mean outcome for out-of-survey units and are trained from the survey sample. While linear assisting models can provide consistent estimates for population means (or totals), machine learning-based models that fit the data better can be vastly more efficient. This gain in efficiency can be especially valuable for enabling a statistic to be published broken down into many strata groups due to sufficiently low standard errors in the groups, which significantly enhances the value of the published statistic.

The adoption of machine learning into official statistics production requires theory for how (and under what conditions) rigorous statistical inference can be done using predictions from machine learning models. I present several existing innovative approaches that address this. These include Sande et al’s work (1) on how properly sampled training and test sets for machine learning guarantee consistent estimation; the Super Learner approach (2) in which choosing the best linear combination of algorithms from an arbitrarily large library guarantees optimal performance; and the targeted learning methodology (3) that constructs consistent and optimally efficient estimators, while also giving a formula for variance. Finally I will present a basic application of these methods to a real dataset at Statistics Norway on estimating job vacancies.

References:

  1. Design-unbiased statistical learning in survey sampling. arXiv:2003.11423v1. LS Sande and L-C Zhang, 2020.
  2. Super learner. Statistical Applications of Genetics and Molecular Biology, 6, article 25. MJ van der Laan, EC Polley, and AE Hubbard, 2007.
  3. Targeted learning: causal inference for observational and experimental data. Springer Series in Statistics. MJ van der Laan and S Rose, 2011.

*: Speaker

1: Statistics Norway