Table of Contents
Fetching ...

The Role of Generative Systems in Historical Photography Management: A Case Study on Catalan Archives

Èric Śanchez, Adrià Molina, Oriol Ramos Terrades

TL;DR

The study examines how generative systems influence automated captioning for historical Catalan archives, addressing language bias and historical-domain shift. It deploys the compact CATR captioning framework and evaluates the contributions of image generation and text generation using synthetic data and multilingual pretraining. Key findings show that natural images with translated captions outperform synthetic-data-only strategies, while language proximity and data scale significantly shape performance; synthetic images offer limited gains and can introduce noise. The work provides practical guidance for heritage institutions on transfer-learning configurations and highlights the need for domain-adaptation methods to responsibly apply generative tools in historical, multilingual contexts.

Abstract

The use of image analysis in automated photography management is an increasing trend in heritage institutions. Such tools alleviate the human cost associated with the manual and expensive annotation of new data sources while facilitating fast access to the citizenship through online indexes and search engines. However, available tagging and description tools are usually designed around modern photographs in English, neglecting historical corpora in minoritized languages, each of which exhibits intrinsic particularities. The primary objective of this research is to study the quantitative contribution of generative systems in the description of historical sources. This is done by contextualizing the task of captioning historical photographs from the Catalan archives as a case study. Our findings provide practitioners with tools and directions on transfer learning for captioning models based on visual adaptation and linguistic proximity.

The Role of Generative Systems in Historical Photography Management: A Case Study on Catalan Archives

TL;DR

The study examines how generative systems influence automated captioning for historical Catalan archives, addressing language bias and historical-domain shift. It deploys the compact CATR captioning framework and evaluates the contributions of image generation and text generation using synthetic data and multilingual pretraining. Key findings show that natural images with translated captions outperform synthetic-data-only strategies, while language proximity and data scale significantly shape performance; synthetic images offer limited gains and can introduce noise. The work provides practical guidance for heritage institutions on transfer-learning configurations and highlights the need for domain-adaptation methods to responsibly apply generative tools in historical, multilingual contexts.

Abstract

The use of image analysis in automated photography management is an increasing trend in heritage institutions. Such tools alleviate the human cost associated with the manual and expensive annotation of new data sources while facilitating fast access to the citizenship through online indexes and search engines. However, available tagging and description tools are usually designed around modern photographs in English, neglecting historical corpora in minoritized languages, each of which exhibits intrinsic particularities. The primary objective of this research is to study the quantitative contribution of generative systems in the description of historical sources. This is done by contextualizing the task of captioning historical photographs from the Catalan archives as a case study. Our findings provide practitioners with tools and directions on transfer learning for captioning models based on visual adaptation and linguistic proximity.
Paper Structure (14 sections, 10 figures, 4 tables)

This paper contains 14 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Example where the visual attributes associated with the token "car" exhibit a higher visual variance in historical (left) than in modern (right) photographs.
  • Figure 2: CATR model architecture
  • Figure 3: Samples from the XAC collection, we observe an important presence of named entities. The collection also shows high temporal and content diversity.
  • Figure 4: Gemma-2gemma_2024 tokenization count for English versus translated captions (left) and distribution of number of tokens per language (right).
  • Figure 5: Poor representation of interactions.
  • ...and 5 more figures