Table of Contents
Fetching ...

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

Shrey Mishra, Antoine Gauquier, Pierre Senellart

TL;DR

This work proposes a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are fed into a novel multimodal sliding window transformer architecture to capture sequential information across paragraphs.

Abstract

We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our document AI methodology stands out as it eliminates the need for OCR preprocessing, LaTeX sources during inference, or custom pre-training on specialized losses to understand cross-modality relationships. Unlike many conventional approaches that operate at a single-page level, ours can be directly applied to multi-page PDFs and seamlessly handles the page breaks often found in lengthy scientific mathematical documents. Our approach demonstrates performance improvements obtained by transitioning from unimodality to multimodality, and finally by incorporating sequential modeling over paragraphs.

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

TL;DR

This work proposes a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are fed into a novel multimodal sliding window transformer architecture to capture sequential information across paragraphs.

Abstract

We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our document AI methodology stands out as it eliminates the need for OCR preprocessing, LaTeX sources during inference, or custom pre-training on specialized losses to understand cross-modality relationships. Unlike many conventional approaches that operate at a single-page level, ours can be directly applied to multi-page PDFs and seamlessly handles the page breaks often found in lengthy scientific mathematical documents. Our approach demonstrates performance improvements obtained by transitioning from unimodality to multimodality, and finally by incorporating sequential modeling over paragraphs.
Paper Structure (17 sections, 17 figures, 20 tables)

This paper contains 17 sections, 17 figures, 20 tables.

Figures (17)

  • Figure 1: Overall model Inference pipeline
  • Figure 2: Vocabulary overlap among popular language models (BERT, DistilBERT, SciBERT, in cased or uncased variants) and our pretrained model (labeled as trained_tokenizer here)
  • Figure 3: MLM loss for two pretrained models, as a function of the pretraining epoch
  • Figure 4: Visualising the attention maps of a finetuned transformer Language model
  • Figure 5: Grad-CAM visualizations of some sample blocks
  • ...and 12 more figures