Table of Contents
Fetching ...

Automatic Recognition of Learning Resource Category in a Digital Library

Soumya Banerjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, Partha Pratim Das

TL;DR

The paper tackles automatic recognition of learning resource categories in digital libraries to guide metadata extraction. It introduces the HLR dataset and a two-branch classifier that fuses image features from a VGG16 backbone with OCR-derived text processed via GLOVE embeddings and a bi-LSTM with self-attention, trained on the HLR collection. Key results show the single-page classifier achieving $94.15$ accuracy and the image-only variant reaching $92.1$ accuracy, with multi-page document classification reaching $95$ accuracy after targeted meta-knowledge corrections; the dataset and code are released to support further research. This work enables scalable, multilingual metadata extraction across diverse document types in large digital libraries, addressing a gap in handling heterogeneous, multi-page resources.

Abstract

Digital libraries often face the challenge of processing a large volume of diverse document types. The manual collection and tagging of metadata can be a time-consuming and error-prone task. To address this, we aim to develop an automatic metadata extractor for digital libraries. In this work, we introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification. The approach involves decomposing individual learning resources into constituent document images (sheets). These images are then processed through an OCR tool to extract textual representation. State-of-the-art classifiers are employed to classify both the document image and its textual content. Subsequently, the labels of the constituent document images are utilized to predict the label of the overall document.

Automatic Recognition of Learning Resource Category in a Digital Library

TL;DR

The paper tackles automatic recognition of learning resource categories in digital libraries to guide metadata extraction. It introduces the HLR dataset and a two-branch classifier that fuses image features from a VGG16 backbone with OCR-derived text processed via GLOVE embeddings and a bi-LSTM with self-attention, trained on the HLR collection. Key results show the single-page classifier achieving accuracy and the image-only variant reaching accuracy, with multi-page document classification reaching accuracy after targeted meta-knowledge corrections; the dataset and code are released to support further research. This work enables scalable, multilingual metadata extraction across diverse document types in large digital libraries, addressing a gap in handling heterogeneous, multi-page resources.

Abstract

Digital libraries often face the challenge of processing a large volume of diverse document types. The manual collection and tagging of metadata can be a time-consuming and error-prone task. To address this, we aim to develop an automatic metadata extractor for digital libraries. In this work, we introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification. The approach involves decomposing individual learning resources into constituent document images (sheets). These images are then processed through an OCR tool to extract textual representation. State-of-the-art classifiers are employed to classify both the document image and its textual content. Subsequently, the labels of the constituent document images are utilized to predict the label of the overall document.
Paper Structure (5 sections, 3 figures)

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: HLR Dataset Sample
  • Figure 2: Confusion matrix for single page classification tasks.
  • Figure 3: Confusion matrices for multi-page classification tasks.