3CEL: A corpus of legal Spanish contract clauses
Nuria Aldama García, Patricia Marsà Morales, David Betancur Sánchez, Álvaro Barbero Jiménez, Marta Guerrero Nieto, Pablo Haya Coll, Patricia Martín Chozas, Elena Montiel Ponsoda
TL;DR
This paper presents 3CEL, a corpus of legal Spanish contract clauses for contract information extraction, produced under the INESData 2024 initiative. 3CEL comprises 373 manually annotated tenders annotated with 19 categories (4,782 total tags), enabling structured extraction of contract-relevant information. By delivering a high-quality, domain-specific dataset in Spanish, the work addresses the scarcity of Spanish legal NLP resources and provides a resource for developing and evaluating IE systems in the legal domain. The dataset supports contract understanding and review in Spanish legal texts and contributes to broader NLP resource development for underrepresented languages.
Abstract
Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Politécnica de Madrid (UPM) and developed by Instituto de Ingeniería del Conocimiento (IIC) to create a series of state-of-the-art NLP resources applied to the legal/administrative domain in Spanish. The goal of this paper is to present the Corpus of Legal Spanish Contract Clauses (3CEL), which is a contract information extraction corpus developed within the framework of INESData 2024. 3CEL contains 373 manually annotated tenders using 19 defined categories (4 782 total tags) that identify key information for contract understanding and reviewing.
