Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 4.2 Processes

An open source data science platform to foster innovative and production-ready machine learning systems

Romain Avouac* 1, Thomas Faria* 1

Abstract

Machine learning is becoming an increasingly important tool for the production of official statistics. Not incidentally, an increasing number of public statisticians trained as data scientists have joined NSIs in recent years. However, these new profiles often find themselves isolated in national statistical systems, and their ability to derive value from machine learning methods is limited by several challenges.

The first challenge is related to the lack of proper IT infrastructures. Training a machine learning model generally involves large amounts of data and high computational capacity. Similarly, emerging techniques often require specific hardware, such as GPUs, to perform computation in a massively parallelized way. Such resources are rarely found in personal computers or traditional IT infrastructures.

Another challenge is the transition from machine learning experiments to production-ready solutions. Production environments often differ from development environments, in such a way that the additional development costs needed to go from a proof of concept to a system working in production can limit the feasibility of this transition. Besides, in a production setting, a machine learning system needs both to be scaled to changing demand and to be properly monitored. Finally, it is generally the case that models need to be periodically or continuously updated, which require proper management of their lifecycle in order to ensure reproducibility. These various challenges highlight the need for both technical infrastructure and automation tools that can help statisticians and IT teams to implement the best practices advocated by the MLOps approach.

Against that background, we developed the SSP Cloud, an open-innovation data science platform built upon state-of-the-art IT components to provide statisticians with scalable and reproducible environments. The platform is based on three deeply structuring choices: cloud computing, object-storage and containerization, which enable to provide extensive computing resources – the benefits of a centralized infrastructure – while managing concurrency in the access to these resources and services isolation. We provide an extensive catalog of services to cover the entire lifecycle of a machine learning project : interactive services (R, Python, Julia) for the development phase and automatization tools (MLFlow to industrialize models training, argo-workflow to orchestrate parallel jobs) to develop production-ready systems.

This presentation aims to provide insights into how the MLOps approach, coupled with a robust data science platform, enabled the successful implementation of a ML model in a real-world scenario, emphasizing the importance of proper lifecycle management and the tools available to achieve it. We will illustrate this approach with the NACE classifier model, that is the first ML model that have been implemented in a production environment at Insee, in order to make our talk as concrete as possible. The various stages from model training to deployment, monitoring, and retraining will be detailed. The MLOps approach played a crucial role in streamlining these processes, and our platform, emerged as a key facilitator.

The building principles of this platform where further refined into an open-source project : Onyxia. As a result, public organizations can create their own internal instance of this modern data science platform and tailor it to the needs of their end users.

*: Speaker

1: Insee - France