Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

Shrey Mishra; Antoine Gauquier; Pierre Senellart

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

Shrey Mishra, Antoine Gauquier, Pierre Senellart

TL;DR

This work proposes a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are fed into a novel multimodal sliding window transformer architecture to capture sequential information across paragraphs.

Abstract

We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our document AI methodology stands out as it eliminates the need for OCR preprocessing, LaTeX sources during inference, or custom pre-training on specialized losses to understand cross-modality relationships. Unlike many conventional approaches that operate at a single-page level, ours can be directly applied to multi-page PDFs and seamlessly handles the page breaks often found in lengthy scientific mathematical documents. Our approach demonstrates performance improvements obtained by transitioning from unimodality to multimodality, and finally by incorporating sequential modeling over paragraphs.

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

TL;DR

Abstract

Paper Structure (17 sections, 17 figures, 20 tables)

This paper contains 17 sections, 17 figures, 20 tables.

Introduction
Related Work
Unimodal Models
Text Modality
Text Modality
Vision Modality
Vision Modality
Font Modality
Font Modality
Multimodal and Sequential Models
Dataset and Setup
Experimental Results
Experimental Results
Hyper parameter tuning of SW Transformer
Interpretability
...and 2 more sections

Figures (17)

Figure 1: Overall model Inference pipeline
Figure 2: Vocabulary overlap among popular language models (BERT, DistilBERT, SciBERT, in cased or uncased variants) and our pretrained model (labeled as trained_tokenizer here)
Figure 3: MLM loss for two pretrained models, as a function of the pretraining epoch
Figure 4: Visualising the attention maps of a finetuned transformer Language model
Figure 5: Grad-CAM visualizations of some sample blocks
...and 12 more figures

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

TL;DR

Abstract

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

Authors

TL;DR

Abstract

Table of Contents

Figures (17)