Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Plenary 5

On Three Methodological Challenges – Machine Learning under Complex Sample Designs, Complex Evaluation Structures and Complex Uncertainty

Thomas Augustin* 1, Malte Nalenz1, Julian Rodemann1, Mirja Hodel1, Christoph Jansen1, Georg Schollmeyer1


The talk gives an overview of current research on three challenges in the foundations of machine learning stressing their practical relevance.

The first part is concerned with complex samples with unequal selection/inclusion probabilities, typical for survey data. While most theoretical and applied work on machine learning assumes that the data are i.i.d., we demonstrate that taking the sampling design carefully into account may be crucial. Concretely, we consider two exemplary situations. Firstly, we provide Horvitz-Thompson and Hajek-type estimators for the generalization error of different learners and confirm by simulation that they are clearly superior to the standard procedure, ignoring the design. Secondly, we study tree-based methods under complex samples. We trace back the recursive construction of regression trees to local MSE/variance minimization, for which we characterize the arising bias. Then, we propose a Hajek-type variance estimator that substantially reduces the bias in the resulting trees, both in predictions and the tree structure. By simulation, we show that such a correction also proves powerful for random forests and illustrate our finding with housing data from Seoul.

The second and third parts discuss the thesis that current developments in the foundations of statistics, including decision theory and uncertainty quantification, can substantially contribute to practical machine learning. Indeed, we utilize a recently proposed concept of Generalized Stochastic Dominance to compare classifiers simultaneously over multiple data sets with respect to several evaluation criteria and show how to confirm observed quality differences statistically by extending Demsar’s test. Finally, we speculate on the role of set-valued methods for robust predictions in machine learning under complex uncertainty.

Major References

Nalenz, Rodemann, Augustin (2024, in press). Learning de-biased regression trees and forests from complex samples. Machine Learning.

Jansen, Nalenz, Schollmeyer, Augustin (2023). Statistical comparisons of classifiers by generalized stochastic dominance. Journal of Machine Learning Research 24:1-37.

*: Speaker

1: LMU Munich - Germany