Table of Contents
Fetching ...

GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Kyle Deeds, Ying-Hsiang Huang, Claire Gong, Shreya Shaji, Alison Yan, Leslie Harka, Samuel J Klein, Shannon Zejiang Shen, Mark Phillips, Trevor Owens, Benjamin Charles Germain Lee

TL;DR

GovScape presents a publicly deployed multimodal search system for 10,015,993 federal PDFs (70,958,487 pages) from the 2020 End of Term crawl, addressing deep access and discoverability gaps in large-scale web archives. It supports semantic text search, visual search over per-page imagery, and exact keyword search, all combinable with metadata filters, by leveraging a scalable embedding pipeline (text via BAAI/bge-base-en-v1.5, images via CLIP) and Faiss-based nearest-neighbor search, plus SQLite FTS5 for exact text queries. The pre-processing pipeline processes the dataset at an estimated cost of about $1,500, achieving roughly 47,000 pages per dollar, and the authors release open-source code to enable replication and extension to 100+ million PDFs. The work demonstrates a practical, scalable approach to multimodal search in massive public document collections and outlines concrete future enhancements, including OCR for non-text PDFs, multilingual support, and expansion to mixed file types, with implications for journalism, research, and public accountability.

Abstract

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.

GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

TL;DR

GovScape presents a publicly deployed multimodal search system for 10,015,993 federal PDFs (70,958,487 pages) from the 2020 End of Term crawl, addressing deep access and discoverability gaps in large-scale web archives. It supports semantic text search, visual search over per-page imagery, and exact keyword search, all combinable with metadata filters, by leveraging a scalable embedding pipeline (text via BAAI/bge-base-en-v1.5, images via CLIP) and Faiss-based nearest-neighbor search, plus SQLite FTS5 for exact text queries. The pre-processing pipeline processes the dataset at an estimated cost of about $1,500, achieving roughly 47,000 pages per dollar, and the authors release open-source code to enable replication and extension to 100+ million PDFs. The work demonstrates a practical, scalable approach to multimodal search in massive public document collections and outlines concrete future enhancements, including OCR for non-text PDFs, multilingual support, and expansion to mixed file types, with implications for journalism, research, and public accountability.

Abstract

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.

Paper Structure

This paper contains 26 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of GovScape. Our public search system supports three types of search over 10,015,993 million government PDFs (70,958,487 PDF pages): 1) semantic text search over PDF text, 2) visual search over individual PDF pages (treated as images), and 3) keyword search over PDF text, all of which can be applied in conjunction with filter conditions against metadata, including domain and crawl date.
  • Figure 2: An overview of the GovScape pre-processing pipeline, showing how a single PDF in GovScape is parsed and semantified.
  • Figure 3: Examples of semantic text search (Figure \ref{['fig:semantic_text_example']}) and visual search (Figure \ref{['fig:visual_example']}) in GovScape.
  • Figure 4: An overview of the GovScape architecture, showing how the constituent parts of the system interact with one another.
  • Figure 5: A screenshot showing the selected PDF view for detailed document inspection (in this case, the fourth page of a redacted FCC document). Clicking on the first search result brings up this view, which shows the selected PDF page in the modal, its associated metadata (domain, crawl date, and crawl URL), a button to download the PDF, a button to share a link to the PDF, and thumbnail views of the other PDF pages.