Table of Contents
Fetching ...

Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

Emanuela Boros, Maud Ehrmann

TL;DR

Experiments demonstrate the existence of OCR-sensitive regions within the Transformer architecture and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.

Abstract

This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.

Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

TL;DR

Experiments demonstrate the existence of OCR-sensitive regions within the Transformer architecture and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.

Abstract

This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.
Paper Structure (11 sections, 7 figures)

This paper contains 11 sections, 7 figures.

Figures (7)

  • Figure 1: This figure shows 1) The CKA values (y-axis) between correct and low|average|high OCR noise-altered input tokens and 2) the number of OCR-sensitive neurons (secondary y-axis) for each of the 32 layers (x-axis) for Llama (top) and Mistral (bottom).
  • Figure 2: F1 score improvements with neuron activation modifications by neuron bins (y-axis) and layers (x-axis) of Llama2 and Mistral on hipe2020 with a neutralising factor of 0.9. Warmer colour indicates higher F1 score improvement.
  • Figure 3: F1-score improvements of Llama2 (top) and Mistral (bottom) on ajmc.
  • Figure 4: The improvement scores values in F1 across different layers in Llama2 and neuron bins for the ajmc dataset with a neutralising factor of 0.1 and 0.5.
  • Figure 5: The improvement scores values in F1 across different layers in Llama2 and neuron bins for the hipe2020 dataset with a neutralising factor of 0.1 and 0.5.
  • ...and 2 more figures