Table of Contents
Fetching ...

CO-Fun: A German Dataset on Company Outsourcing in Fund Prospectuses for Named Entity Recognition and Relation Extraction

Neda Foroutan, Markus Schröder, Andreas Dengel

TL;DR

CO-Fun addresses the need for structured extraction of outsourcing-related information from German fund prospectuses to support cyber mapping in finance. It introduces 948 annotated sentences spanning four entity types and two relation types, and benchmarks NER and RE with CRF, BERT, and RoBERTa, demonstrating competitive performance and feasibility. The dataset is anonymized and publicly released under an MIT license, enabling broader use and future enhancements such as knowledge-graph integration for improved cyber risk analysis. Overall, CO-Fun provides a practical resource for German NLP in the financial domain and a foundation for downstream cyber-risk mapping tasks.

Abstract

The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing-Company, Company-Location). State-of-the-art deep learning models were trained to recognize entities and extract relations showing first promising results. An anonymized version of the dataset, along with guidelines and the code used for model training, are publicly available at https://www.dfki.uni-kl.de/cybermapping/data/CO-Fun-1.0-anonymized.zip.

CO-Fun: A German Dataset on Company Outsourcing in Fund Prospectuses for Named Entity Recognition and Relation Extraction

TL;DR

CO-Fun addresses the need for structured extraction of outsourcing-related information from German fund prospectuses to support cyber mapping in finance. It introduces 948 annotated sentences spanning four entity types and two relation types, and benchmarks NER and RE with CRF, BERT, and RoBERTa, demonstrating competitive performance and feasibility. The dataset is anonymized and publicly released under an MIT license, enabling broader use and future enhancements such as knowledge-graph integration for improved cyber risk analysis. Overall, CO-Fun provides a practical resource for German NLP in the financial domain and a foundation for downstream cyber-risk mapping tasks.

Abstract

The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing-Company, Company-Location). State-of-the-art deep learning models were trained to recognize entities and extract relations showing first promising results. An anonymized version of the dataset, along with guidelines and the code used for model training, are publicly available at https://www.dfki.uni-kl.de/cybermapping/data/CO-Fun-1.0-anonymized.zip.
Paper Structure (12 sections, 1 figure, 2 tables)

This paper contains 12 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: A graphical user interface in German to annotate a sentence (top) with named entities (center) and relations (bottom). Entity types are 'Auslagerung [Outsourcing]', 'Unternehmen [Company]', 'Ort [Location]' and Software.

Theorems & Definitions (1)

  • Example 1