Table of Contents
Fetching ...

Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway

Tita Enstad, Trond Trosterud, Marie Iversdatter Røsok, Yngvil Beyer, Marie Roald

TL;DR

This study addresses the challenge of OCR accuracy for Sámi texts in NLN by evaluating Transkribus, Tesseract, and TrOCR on a Sámi-rich corpus and by exploring data strategies tailored to a low-resource language setting. It shows that fine-tuning with ground-truth Sámi data, augmented by machine-transcribed and synthetic text, substantially improves transcription quality on NLN data, with Transkribus and TrOCR typically outperforming Tesseract in-domain, while Tesseract can excel in out-of-domain contexts. A two-stage synthetic pretraining regime consistently boosts performance, and the combination of synthetic data with limited manual annotations enables robust Sámi OCR with modest annotation effort. The findings support re-OCR campaigns for NLN materials and point to future work on non-standard orthographies, Skolt Sámi, and broader model exploration to further enhance low-resource language OCR pipelines.

Abstract

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.

Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway

TL;DR

This study addresses the challenge of OCR accuracy for Sámi texts in NLN by evaluating Transkribus, Tesseract, and TrOCR on a Sámi-rich corpus and by exploring data strategies tailored to a low-resource language setting. It shows that fine-tuning with ground-truth Sámi data, augmented by machine-transcribed and synthetic text, substantially improves transcription quality on NLN data, with Transkribus and TrOCR typically outperforming Tesseract in-domain, while Tesseract can excel in out-of-domain contexts. A two-stage synthetic pretraining regime consistently boosts performance, and the combination of synthetic data with limited manual annotations enables robust Sámi OCR with modest annotation effort. The findings support re-OCR campaigns for NLN materials and point to future work on non-standard orthographies, Skolt Sámi, and broader model exploration to further enhance low-resource language OCR pipelines.

Abstract

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
Paper Structure (29 sections, 7 tables)