Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 1.1 Using Large Language Models

Enhancing Accessibility to Statistical Data through an Open-Source Chatbot Integrated with Language Models

Eva Charlotte Berner* 1, Benedikt Goodwin1, Susie Jentoft1, Eirik Fredborg1

Abstract

This paper details the development of a novel, open-source chatbot aimed at enhancing data accessibility from Statistics Norway's databank. In response to the exigent requirement for user-centric data retrieval mechanisms, we employ the advanced capabilities of selected open-source and proprietary large language models (LLMs). The foundational architecture of this system amalgamates Python, asynchronous HTTP request handling, and the LangChain library, ensuring optimal integration with the LLM.

The framework of the chatbot incorporates an array of specialized tools and functions for interactive engagement with the SSB's API. This allows users to navigate a broad spectrum of data operations efficiently. Key functionalities encompass the ability to search for relevant data tables based on user queries, retrieve metadata for specific tables to understand their structure, and execute structured queries on the identified tables to fetch the required data.

A crucial aspect of the chatbot's design is its user-oriented interface, facilitated by the ChainLit library. This feature establishes an intuitive and interactive platform for users to articulate queries and receive prompt, accurate responses. The system's asynchronous configuration enhances its performance and scalability, managing concurrent user requests without compromising response latency.

The project stands at a preliminary stage, showcasing a minimal viable product with significant potential for enhancement. Future research will focus on a thorough analysis of pertinent LLMs, including open-source and proprietary models, examining their capabilities, biases, and suitability, aiming to refine the chatbot's functionalities and address any limitations.

*: Speaker

1: Statistics Norway