Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 1.1 Using Large Language Models

Extracting Data Citations with Large Language Models

Sebastian Seltmann1, Emily Kormanyos1, Hendrik Doll* 1


In national statistical offices and central banks, research data centers (RDCs) proliferate in recent years, tasked with providing secure access to granular administrative data for research. Empirical researchers and RDCs face challenges in efficiently tracing data used in scholarly papers. This process currently relies on human readers and is time-consuming and prone to errors. To address this, we explore the potential of using Large Language Models (LLMs), specifically GPT-3.5, to automate the identification and categorization of research data sources. We analyze the accuracy of GPT-3.5 in detecting and summarizing data sources in economics and finance papers.

By employing web-scraping, we collect a comprehensive sample of research papers and create human-labeled validation datasets. We evaluate the detection and prediction accuracy and address the issue of false answers provided by the model. We find that LLMs can advance considerably on the status quo. Results are encouraging in terms of precision and recall. Additionally, we assess the pre-processing requirements of GPT-3.5 for cost-effective implementation. Furthermore, our paper provides a guide for implementing our proposed solution at data-providing institutions and RDCs, aiming to enhance data analysis and research data provision services.

*: Speaker

1: Deutsche Bundesbank - Germany