NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding
Aniket Pal, Sanket Biswas, Alloy Das, Ayush Lodh, Priyanka Banerjee, Soumitri Chattopadhyay, Dimosthenis Karatzas, Josep Llados, C. V. Jawahar
TL;DR
NoTeS-Bank introduces a challenging benchmark for neural transcription and search over unstructured, handwritten scientific notes, addressing equations, diagrams, and notation with two tasks: Evidence-Based VQA ($A,E = f_{EB ext{-}QA}(I,Q)$, $E=\{(B_i,L_i,G_i)\}$) and Open-Domain QA ($C=f_{domain}(Q)$, $I=f_{retrieve}(D,Q,C)$, $A=f_{answer}(I,Q)$). It combines diverse domains and meticulous annotations to enable ground-truth bounding boxes and domain labels, evaluated with metrics including $ANLS$, $IoU$, $Hit@K$, $MRR$, and $NDCG@5$, enabling a fine-grained diagnostic of multimodal reasoning and retrieval. A broad set of baselines—Vision-Language Models, OCR+LLM pipelines, and retrieval-augmented approaches—are compared, revealing that current systems still lag human performance, especially in grounding evidence and handling complex symbolic content. The results highlight modality gaps in handwritten document understanding and point toward future directions such as spatially-aware tokenization, better handling of handwritten mathematical and chemical content, and hierarchical retrieval to improve both grounding and open-domain retrieval. Overall, NoTeS-Bank provides a rigorous, multimodal benchmark to spur advances in grounded reasoning over real-world handwritten scientific notes, with potential impact on education, archival, and scientific note-taking tools.
Abstract
Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.
