Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

Jamie Mahowald; Benjamin Charles Germain Lee

Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

Jamie Mahowald, Benjamin Charles Germain Lee

TL;DR

This paper explores the potential for interactively searching large-scale map collections using natural language inputs, visual inputs, and multimodal inputs using the mulitmodal Contrastive Language-Image Pre-training (CLIP) machine learning model.

Abstract

Despite the prevalence and historical importance of maps in digital collections, current methods of navigating and exploring map collections are largely restricted to catalog records and structured metadata. In this paper, we explore the potential for interactively searching large-scale map collections using natural language inputs ("maps with sea monsters"), visual inputs (i.e., reverse image search), and multimodal inputs (an example map + "more grayscale"). As a case study, we adopt 562,842 images of maps publicly accessible via the Library of Congress's API. To accomplish this, we use the mulitmodal Contrastive Language-Image Pre-training (CLIP) machine learning model to generate embeddings for these maps, and we develop code to implement exploratory search capabilities with these input strategies. We present results for example searches created in consultation with staff in the Library of Congress's Geography and Map Division and describe the strengths, weaknesses, and possibilities for these search queries. Moreover, we introduce a fine-tuning dataset of 10,504 map-caption pairs, along with an architecture for fine-tuning a CLIP model on this dataset. To facilitate re-use, we provide all of our code in documented, interactive Jupyter notebooks and place all code into the public domain. Lastly, we discuss the opportunities and challenges for applying these approaches across both digitized and born-digital collections held by galleries, libraries, archives, and museums.

Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 7 figures)

This paper contains 21 sections, 2 equations, 7 figures.

Introduction
Related Work
Methodology 1: Generating CLIP Embeddings for 562,842 Images of Maps
Our Dataset of Library of Congress Maps
Generating Embeddings
A Dataset & Architecture for Fine-tuning
Methodology 2: Implementing Search & Discovery with CLIP Embeddings
Text-input Search
Image-input Search
Text- & Image-input Search
Results & Discussion
Search Results
Text-input Search
Image-input Search
Text- & Image-input Search
...and 6 more sections

Figures (7)

Figure 1: An overview of our search implementation for 562,842 images of maps held by the Library of Congress. Users can create three different types of inputs: 1) text, 2) image, and 3) text & image. The specified search input is used to retrieve the most relevant maps by dynamically computing a CLIP embedding for the search input and comparing it to the pre-computed CLIP embeddings of all 562,842 images (more information on the beto architecture used for this step can be found in Section \ref{['sec:methodology']}). The top $k$ search results are then returned to the user.
Figure 2: An overview of our embeddings generation pipeline along with the resulting metadata dictionaries, as described in this section. Each box represents a function; each function's left column lists objects created, while the right column lists the method of creation. Arrows between boxes represent items piped through successive functions.
Figure 3: A natural language caption is extracted from the loc.gov metadata through the get_caption function without the need for a language model. The function is designed to avoid redundancy –– in this instance, it skips over the entries in "location" that include terms already extracted from the title. Metadata notes tend to be formatted such that only the first several items contain visual feature information, while the last two notes contain catalogue and availability data less pertinent to feature extraction.
Figure 4: Three examples of the map-caption pairs from our resulting dataset.
Figure 5: The most relevant images returned by our search system for the following text queries: "tattered and worn map," "old panoramic map surrounded by images of buildings," and "a map with illustrations of 19th-century ships." Non-normalized similarity scores are shown for each returned image.
...and 2 more figures

Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

TL;DR

Abstract

Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

Authors

TL;DR

Abstract

Table of Contents

Figures (7)