A tentative framework to document the process of synthetic survey data generation with large language models
Roth, Matthias
GESIS – Leibniz Institute for the Social Sciences in Mannheim
Abstract
Recent research has suggested that large language models (LLM) can be a source for high quality synthetic survey data. Use-cases include simulating representative survey data, imputation, or evaluation of LLM response behavior. However, comparing the results of different studies is difficult as a multitude of approaches to create synthetic survey data exist. This is problematic because it has also been shown that methodological choices can severely impact the substantive conclusions drawn from synthetic survey data. A framework for documenting the process of generating synthetic survey data is needed to increase comparability between studies.
I present a tentative framework to document the process of synthetic survey data generation. I structure the creation of synthetic survey data into four steps: (1) Prompt design, (2) model specification, (3) response generation and (4) analytical choices. Prompt design relates to how survey questions are presented to the LLM and whether previous answers are considered. Model specification describes what LLM is used. For example, it is important to specify whether a text completion or instruction tuned model was used, as they are designed to react in different ways. Response generation refers to the way responses are extracted from the LLM. Among the prominent choices are text matching or direct usage of response probabilities. Finally, depending on the previous steps, different analytical choices can be made, such as aggregating responses per synthetic respondent or by (sub-)population.
For each step, I review use-cases from the literature and highlight how methodological choices can impact the distribution of synthetic survey data. Additionally, I present an empirical example using a model based on a publicly available LLM, Llama2. Publicly available LLMs currently allow for the highest degree of transparency and flexibility in the synthetic data generation process. The LLM will be used to generate responses which aim to match probabilistic survey data of the German population.
I conclude with a brief discussion on how the proposed framework improves the comparability of different approaches to synthetic survey data generation and how it compliments other best practices of reproducible research.