Table of Contents
Fetching ...

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem

TL;DR

MeXtract tackles the challenge of accurate metadata extraction from long scientific texts by deploying a family of lightweight LLMs (0.5B–3B) fine-tuned from Qwen 2.5 using LoRA and enhanced by direct preference optimization. The authors extend the MOLE benchmark to MOLE+ by including model papers, enabling evaluation on unseen schemas and cross-domain metadata; they collect and annotate 1,889 papers and distill knowledge from Kimi-K2 to train robust schemas. The approach achieves state-of-the-art results among similarly sized models and demonstrates transfer to unseen model schemas, while releasing all code, data, and models openly. This work advances efficient, schema-guided metadata extraction for large-scale scientific corpora and provides a practical foundation for improved indexing and search across domains.

Abstract

Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

TL;DR

MeXtract tackles the challenge of accurate metadata extraction from long scientific texts by deploying a family of lightweight LLMs (0.5B–3B) fine-tuned from Qwen 2.5 using LoRA and enhanced by direct preference optimization. The authors extend the MOLE benchmark to MOLE+ by including model papers, enabling evaluation on unseen schemas and cross-domain metadata; they collect and annotate 1,889 papers and distill knowledge from Kimi-K2 to train robust schemas. The approach achieves state-of-the-art results among similarly sized models and demonstrates transfer to unseen model schemas, while releasing all code, data, and models openly. This work advances efficient, schema-guided metadata extraction for large-scale scientific corpora and provides a practical foundation for improved indexing and search across domains.

Abstract

Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.

Paper Structure

This paper contains 20 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Schema-based Metadata extraction using MeXtract. The model has three inputs: the paper text, schema, and guidelines, and one output, which is the metadata JSON. This example is only for illustration purposes; the model schema contains 16 attributes.
  • Figure 2: Annotated data collection pipeline for instruction tuning. In each stage, we show the number of papers.
  • Figure 3: Results of all models averaged by year. The results show the F1 score for the years 2023, 2024, and 2025, respectively.
  • Figure 4: Results per 9 attributes for all the models using the MOLE benchmark.
  • Figure 5: System prompt used to extract resource papers. We use Gemma 3 27B to label datasets into 8 categories.