Table of Contents
Fetching ...

Adapting PromptORE for Modern History: Information Extraction from Hispanic Monarchy Documents of the XVIth Century

Hèctor Loopez Hidalgo, Michel Boeglin, David Kahn, Josiane Mothe, Diego Ortiz, David Panzoli

TL;DR

This work tackles the challenge of extracting semantic relations from XVIth-century Spanish historical documents, where archaic language and dense, multi-entity sentences hinder standard relation extraction. It adapts PromptORE by introducing a biasing phase that domain-tunes encoder models on historical Spanish corpora, and by designing prompts that account for gender and anaphora in Spanish. The four-phase Biased PromptORE pipeline (composition, biasing, prompting, and relation extraction) leverages expert texts and a targeted trial to improve accuracy, achieving up to a 50 percent relative gain over baselines, with BETO-based biased models delivering the strongest performance. The results demonstrate the value of domain-specific pretraining and tailored prompting for historical documents, and point to future work in multiclass relation settings and more complex relation types.

Abstract

Semantic relations among entities are a widely accepted method for relation extraction. PromptORE (Prompt-based Open Relation Extraction) was designed to improve relation extraction with Large Language Models on generalistic documents. However, it is less effective when applied to historical documents, in languages other than English. In this study, we introduce an adaptation of PromptORE to extract relations from specialized documents, namely digital transcripts of trials from the Spanish Inquisition. Our approach involves fine-tuning transformer models with their pretraining objective on the data they will perform inference. We refer to this process as "biasing". Our Biased PromptORE addresses complex entity placements and genderism that occur in Spanish texts. We solve these issues by prompt engineering. We evaluate our method using Encoder-like models, corroborating our findings with experts' assessments. Additionally, we evaluate the performance using a binomial classification benchmark. Our results show a substantial improvement in accuracy -up to a 50% improvement with our Biased PromptORE models in comparison to the baseline models using standard PromptORE.

Adapting PromptORE for Modern History: Information Extraction from Hispanic Monarchy Documents of the XVIth Century

TL;DR

This work tackles the challenge of extracting semantic relations from XVIth-century Spanish historical documents, where archaic language and dense, multi-entity sentences hinder standard relation extraction. It adapts PromptORE by introducing a biasing phase that domain-tunes encoder models on historical Spanish corpora, and by designing prompts that account for gender and anaphora in Spanish. The four-phase Biased PromptORE pipeline (composition, biasing, prompting, and relation extraction) leverages expert texts and a targeted trial to improve accuracy, achieving up to a 50 percent relative gain over baselines, with BETO-based biased models delivering the strongest performance. The results demonstrate the value of domain-specific pretraining and tailored prompting for historical documents, and point to future work in multiclass relation settings and more complex relation types.

Abstract

Semantic relations among entities are a widely accepted method for relation extraction. PromptORE (Prompt-based Open Relation Extraction) was designed to improve relation extraction with Large Language Models on generalistic documents. However, it is less effective when applied to historical documents, in languages other than English. In this study, we introduce an adaptation of PromptORE to extract relations from specialized documents, namely digital transcripts of trials from the Spanish Inquisition. Our approach involves fine-tuning transformer models with their pretraining objective on the data they will perform inference. We refer to this process as "biasing". Our Biased PromptORE addresses complex entity placements and genderism that occur in Spanish texts. We solve these issues by prompt engineering. We evaluate our method using Encoder-like models, corroborating our findings with experts' assessments. Additionally, we evaluate the performance using a binomial classification benchmark. Our results show a substantial improvement in accuracy -up to a 50% improvement with our Biased PromptORE models in comparison to the baseline models using standard PromptORE.
Paper Structure (17 sections, 9 equations, 3 figures, 3 tables)

This paper contains 17 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Original PromptORE technique overview, extracted from genest2022promptore. Based on a dataset $\mathcal{D}$ where each phrase contains exactly two entities, the authors encode the relationships with a Mask Language Modelling objective. Then, the [MASK] token needs to be guessed by the model, reducing the cross-entropy loss function, and learning the semantics of the language in the process.
  • Figure 2: An overview of our bias-prompt-extract technique for complex historical documents. The biasing phase is inserted between the composition phase (selection of documents from the period) and the prompting phase (testing PromptORE baselines, biased BERT and RoBERTa models.
  • Figure 3: Number of entities per sentence: splitting the sentences by the punctuation marks revealed the complexity of parsing texts: a significant number of sentences include too many entities for usual tools and humans to be able to draw any reliable information from them.