Table of Contents
Fetching ...

NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Aniket Pal, Sanket Biswas, Alloy Das, Ayush Lodh, Priyanka Banerjee, Soumitri Chattopadhyay, Dimosthenis Karatzas, Josep Llados, C. V. Jawahar

TL;DR

NoTeS-Bank introduces a challenging benchmark for neural transcription and search over unstructured, handwritten scientific notes, addressing equations, diagrams, and notation with two tasks: Evidence-Based VQA ($A,E = f_{EB ext{-}QA}(I,Q)$, $E=\{(B_i,L_i,G_i)\}$) and Open-Domain QA ($C=f_{domain}(Q)$, $I=f_{retrieve}(D,Q,C)$, $A=f_{answer}(I,Q)$). It combines diverse domains and meticulous annotations to enable ground-truth bounding boxes and domain labels, evaluated with metrics including $ANLS$, $IoU$, $Hit@K$, $MRR$, and $NDCG@5$, enabling a fine-grained diagnostic of multimodal reasoning and retrieval. A broad set of baselines—Vision-Language Models, OCR+LLM pipelines, and retrieval-augmented approaches—are compared, revealing that current systems still lag human performance, especially in grounding evidence and handling complex symbolic content. The results highlight modality gaps in handwritten document understanding and point toward future directions such as spatially-aware tokenization, better handling of handwritten mathematical and chemical content, and hierarchical retrieval to improve both grounding and open-domain retrieval. Overall, NoTeS-Bank provides a rigorous, multimodal benchmark to spur advances in grounded reasoning over real-world handwritten scientific notes, with potential impact on education, archival, and scientific note-taking tools.

Abstract

Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

TL;DR

NoTeS-Bank introduces a challenging benchmark for neural transcription and search over unstructured, handwritten scientific notes, addressing equations, diagrams, and notation with two tasks: Evidence-Based VQA (, ) and Open-Domain QA (, , ). It combines diverse domains and meticulous annotations to enable ground-truth bounding boxes and domain labels, evaluated with metrics including , , , , and , enabling a fine-grained diagnostic of multimodal reasoning and retrieval. A broad set of baselines—Vision-Language Models, OCR+LLM pipelines, and retrieval-augmented approaches—are compared, revealing that current systems still lag human performance, especially in grounding evidence and handling complex symbolic content. The results highlight modality gaps in handwritten document understanding and point toward future directions such as spatially-aware tokenization, better handling of handwritten mathematical and chemical content, and hierarchical retrieval to improve both grounding and open-domain retrieval. Overall, NoTeS-Bank provides a rigorous, multimodal benchmark to spur advances in grounded reasoning over real-world handwritten scientific notes, with potential impact on education, archival, and scientific note-taking tools.

Abstract

Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

Paper Structure

This paper contains 12 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of OCRs on a challenging handwritten scientific note sample. Powerful commercial OCR engines (eg. Textract, Google-OCR) fail to accurately transcribe the content, often losing mathematical symbols, structure, and semantic meaning. In contrast, open-source OCRs (e.g. Nougat, Got 2.0, OLM-OCR) struggle to extract even the textual content for the full document. This highlights the necessity for multimodal reasoning beyond OCR in handwritten document understanding.
  • Figure 2: Illustrative examples from the NoTeS-Bank showing diverse local reasoning categories such as equations, flowcharts, structural formulas, and textual answers. Center: Distribution of question types across eight task-level categories. Surrounding: Sample questions and annotated evidence regions from handwritten notes, highlighting the visual and semantic complexity handled in the Evidence-Based VQA task.
  • Figure 3: (Top) Global category distribution across 19 scientific and technical domains in the NoTeS-Bank benchmark, including physics, biology, chemistry, and computer science. This diverse coverage enables a fine-grained evaluation of domain-specific reasoning. (Bottom) ANLS performance comparison on the Evidence-Based VQA task. Results are shown for human annotators, open/closed Vision-Language Models (VLMs), and OCR+LLM pipelines. The large performance gap highlights the challenge of accurately answering and grounding questions in visually unstructured, handwritten academic notes.
  • Figure 4: Qualitative comparison of Vision-Language Models (VLMs), OCR+LLMs, and human responses on the NoTeS-Bank Evidence-Based VQA task. Each example demonstrates the challenge of retrieving grounded answers from handwritten scientific notes, highlighting the limitations of current models in detecting accurate regions and reasoning over domain-specific content. The figure also illustrates the fine-grained local (e.g., Structural Formula, Flowchart) and global (e.g., Organic Chemistry, Reproduction in Organisms) category annotations provided for each question-answer pair.
  • Figure 5: We report the average ANLS for the human expert vs. the best-performing model per diagnostic category as a ceiling analysis.
  • ...and 1 more figures