Table of Contents
Fetching ...

Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway's Digitised Book Collection

Marie Roald, Magnus Breder Birkenes, Lars Gunnarsønn Bagøien Johnsen

TL;DR

This work presents a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer, Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification.

Abstract

Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.

Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway's Digitised Book Collection

TL;DR

This work presents a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer, Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification.

Abstract

Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.

Paper Structure

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Screenshots of the image search application: context-based search for "kat" (old Norwegian for cat) (a) and image-based query with a user-uploaded cat image (c). (b) and (d) show the results when selecting an image in (a) and (c), respectively. The app also has a collapsible sidebar (not shown) that we used for selecting SigLIP embedding vectors.
  • Figure 2: The class distribution for the manually labelled training set and estimated class distribution for the full dataset. (a) shows absolute counts, and (b) shows label distributions for the training set (inner) and estimated distributions for the full dataset (outer).