Use of statistical learning algorithms to integrate administrative and survey data in a Short-Term Business Statistics
Sandra Barragan* 1, David Salgado1, Ester Puerto1, Sergio Pardina1
Abstract
The use of administrative data is a must not only for the modernization of the production of official statistics but also for keeping relevance in the new data and AI international ecosystem. These new data sources have many advantages such as helping to reduce the burden when they are integrated with survey data. However, as it is widely known, this incorporation of new data sources does not come without drawbacks. By and large, the direct substitution, use, or aggregation of administrative data cannot be undertaken since errors both in the representation and measurement lines arise even when formerly they were under control using only survey data.
Representation errors (especially regarding coverage) arise because of unit misclassification errors and other factors. Validity, measurement, and process errors easily occur because of the administrative (non-statistical) purposes of these data sources. Overall, the fact that the data generation mechanism lies outside the control of the statistical process revives both non-sampling errors (validity error, for example) and inferential challenges (non-ignorability, for instance).
We present a proposed end-to-end statistical production process integrating administrative data with survey data in a probability sample. Synthetic values are computed using a statistical learning model by using the administrative data as regressors. So that validity and measurement errors can be a priori identified and kept under control. The statistical learning algorithm learns from past and present survey and administrative data producing high-quality values for non-influential units, which paves the way to reduce response burden. Influential units are still integrated using survey data.
We share a proof of concept on the monthly Services Sector Activity Indicators using monthly VAT data. We discuss challenges regarding the statistical model, the feature engineering and the training data.