Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System
Feng Jiang, Kuang Wang, Haizhou Li
TL;DR
This paper introduces MMAPIS, an open-source, multi-modal framework for automated scientific paper interpretation that processes PDFs into structured text, figures, and tables, aligns multimodal content by section, and applies hierarchical, section-aware summarization to produce cohesive document interpretations. It combines Nougat-based text/formula extraction with PDFFigures 2.0 for visual content, enabling coherent section-level summaries that are integrated into a document-level narrative. The system also offers four downstream, multimodal interfaces—paper recommendations, multimodal Q&A, audio broadcasting, and interpretation blogs—enabled by carefully designed prompts that leverage both Chain-of-Thought and density-based prompting; APIs support downstream customization. Quantitative evaluation on arXiv-derived datasets shows MMAPIS outperforms GPT-4 on key informativeness and overall metrics, underscoring the value of preserving multimodal structure and section-level discourse in long scientific texts. The work advances accessible, efficient engagement with scientific literature and paves the way for real-time, user-tailored interpretations across multiple formats.
Abstract
In the contemporary information era, significantly accelerated by the advent of Large-scale Language Models, the proliferation of scientific literature is reaching unprecedented levels. Researchers urgently require efficient tools for reading and summarizing academic papers, uncovering significant scientific literature, and employing diverse interpretative methodologies. To address this burgeoning demand, the role of automated scientific literature interpretation systems has become paramount. However, prevailing models, both commercial and open-source, confront notable challenges: they often overlook multimodal data, grapple with summarizing over-length texts, and lack diverse user interfaces. In response, we introduce an open-source multi-modal automated academic paper interpretation system (MMAPIS) with three-step process stages, incorporating LLMs to augment its functionality. Our system first employs the hybrid modality preprocessing and alignment module to extract plain text, and tables or figures from documents separately. It then aligns this information based on the section names they belong to, ensuring that data with identical section names are categorized under the same section. Following this, we introduce a hierarchical discourse-aware summarization method. It utilizes the extracted section names to divide the article into shorter text segments, facilitating specific summarizations both within and between sections via LLMs with specific prompts. Finally, we have designed four types of diversified user interfaces, including paper recommendation, multimodal Q\&A, audio broadcasting, and interpretation blog, which can be widely applied across various scenarios. Our qualitative and quantitative evaluations underscore the system's superiority, especially in scientific summarization, where it outperforms solutions relying solely on GPT-4.
