Deep text anonymisation for German legal documents
Stephanie Evert* 1
Abstract
German courts are legally required to publish all their verdicts, but these must be anonymised in order to protect the fundamental right to informational self-determination of those involved. However, because manual anonymisation is costly and time-consuming, only about 3% of the approximately 1.5 million verdicts issued annually are published.
Moreover, very few of these published verdicts come from first-instance courts that handle the basic facts of a case.
This lack of published verdicts not only denies citizens and legal professionals access to a crucial information source, but also deprives the legal tech community of valuable training data and text mining material. We aim at developing algorithms and models for a fully automatic anonymisation of court verdicts, which is the only viable approach that scales to 1.5 million verdicts per year. Full automatisation is a challenging task. Not only must personally identifying information (= PII, such as names, addresses, license plates, etc.) be detected with near-perfect recall; a comprehensive system must also address pseudo-identifiers such as profession details, health data, or unique features of people and objects.
One crucial bottleneck is the lack of error-free gold standards, which are necessary for the evaluation of high-risk AI models (Adrian et al. 2024). To this end, we created two gold standards of approximately 1 million words each: AG comprising verdicts from district courts (limited to tenancy and traffic law), and OLG with verdicts from higher regional courts (across 11 legal domains). To ensure absence of errors (cf. Heinrich et al. 2021), each verdict in the gold standards was annotated by four independent annotators and adjudicated in a subsequent step.
Fine-tuning pre-trained large language models has proven to be highly successful for span detection tasks such as named entity recognition and text anonymisation. Using the German foundation model GottBert (Scheible et al. 2020) and fine-tuning with three linear layers for 5 epochs, we achieve 99% recall for PII for in-domain evaluation on the AG gold standard, with precision and recall being close to 97% across all critical text spans (including pseudo-identifiers). Further improvements can be achieved by systematic parameter optimisation; we present results with the multilingual language model XLM-Roberta (Conneau et al 2019).
While recall drops substantially in a cross-domain setting (i.e. evaluating the model trained on the AG gold standard on verdicts from OLG), this effect depends on the legal domain, with PII recall still around 99% in 7 out of 11 domains. Since models show considerably worse performance on pseudo-identifiers (that are much more domain-specific), we apply several text augmentation techniques such as Easy Data Augmentation (Wei & Zou 2019) to increase the amount of training data and to improve the performance on those instances. Finally, we determine learning curves to show that domain adaption with moderate amounts of training data is sufficient to archieve fairly good results.
References
Adrian, A., Evert, S., Heinrich, P., Keuchen, M. (2024). Auslegung des KI-VO-E zur Evaluation von Verfahren der Künstlichen Intelligenz am Beispiel der automatischen Anonymisierung von Gerichtsentscheidungen. In Juristische Sprachmodelle – Tagungsband des 27. Internationalen Rechtsinformatik Symposions IRIS 2024. Editions Weblaw.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
Heinrich, P., Dykes, N. and Evert, S. (2021). Annotator agreement in the anonymization of court decisions. Presentation at the Corpus Linguistics 2021 Conference, Limerick/online.
Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V. and Boeker, M. (2020). Gottbert: A pure German language model. arXiv preprint arXiv:2012.02110.
Wei, J. and Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China. Association for Computational Linguistics.
*: Speaker
1: Friedrich-Alexander-Universität Erlangen-Nürnberg (AnGer)