Table of Contents
Fetching ...

Callico: a Versatile Open-Source Document Image Annotation Platform

Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie

TL;DR

Callico addresses the data-centric AI challenge in document recognition by providing a versatile, open-source platform for collaborative, dual-display annotation of digitized documents. It combines a flexible design, ML-ready workflows, and on-premise Docker deployment with IIIF-based image access to support OCR, HTR, layout analysis, and information extraction. The paper details Callico’s design principles, technical stack, and real-world use cases (Belfort transcription, ICRC POW information extraction, and Socface census), demonstrating improvements in data quality and annotation efficiency. It also outlines future work on quality assessment, validation strategies, and community-driven expansion to broaden Callico’s impact in the document-annotation landscape.

Abstract

This paper presents Callico, a web-based open source platform designed to simplify the annotation process in document recognition projects. The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data. For document image annotation, Callico offers dual-display annotation for digitised documents, enabling simultaneous visualisation and annotation of scanned images and text. This capability is critical for OCR and HTR model training, document layout analysis, named entity recognition, form-based key value annotation or hierarchical structure annotation with element grouping. The platform supports collaborative annotation with versatile features backed by a commitment to open source development, high-quality code standards and easy deployment via Docker. Illustrative use cases - including the transcription of the Belfort municipal registers, the indexing of French World War II prisoners for the ICRC, and the extraction of personal information from the Socface project's census lists - demonstrate Callico's applicability and utility.

Callico: a Versatile Open-Source Document Image Annotation Platform

TL;DR

Callico addresses the data-centric AI challenge in document recognition by providing a versatile, open-source platform for collaborative, dual-display annotation of digitized documents. It combines a flexible design, ML-ready workflows, and on-premise Docker deployment with IIIF-based image access to support OCR, HTR, layout analysis, and information extraction. The paper details Callico’s design principles, technical stack, and real-world use cases (Belfort transcription, ICRC POW information extraction, and Socface census), demonstrating improvements in data quality and annotation efficiency. It also outlines future work on quality assessment, validation strategies, and community-driven expansion to broaden Callico’s impact in the document-annotation landscape.

Abstract

This paper presents Callico, a web-based open source platform designed to simplify the annotation process in document recognition projects. The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data. For document image annotation, Callico offers dual-display annotation for digitised documents, enabling simultaneous visualisation and annotation of scanned images and text. This capability is critical for OCR and HTR model training, document layout analysis, named entity recognition, form-based key value annotation or hierarchical structure annotation with element grouping. The platform supports collaborative annotation with versatile features backed by a commitment to open source development, high-quality code standards and easy deployment via Docker. Illustrative use cases - including the transcription of the Belfort municipal registers, the indexing of French World War II prisoners for the ICRC, and the extraction of personal information from the Socface project's census lists - demonstrate Callico's applicability and utility.
Paper Structure (27 sections, 6 figures, 1 table)

This paper contains 27 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Annotation mode for Document Structuring: an example of annotating different zones on a map using Document Structure mode, which allows the user to locate and type different zones on the document.
  • Figure 2: Annotation mode for Text Transcription: an example of line transcription with image and text side by side.
  • Figure 3: Annotation mode for Named Entities: an example of annotating entities on a text by defining their range and type.
  • Figure 4: Annotation mode for Key-Value Information: an example of annotating personal information from a table, with line highlighting.
  • Figure 5: Annotation mode for Element Grouping: an example of annotating a newspaper by grouping the different elements of each article.
  • ...and 1 more figures