Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding
Balaji Muralidharan, Hayden Beadles, Reza Marzban, Kalyan Sashank Mupparaju
TL;DR
Knowledge AI develops a domain-specific fine-tuning framework for scientific NLP tasks, aiming to democratize access to scientific knowledge. By adapting LLMs to four core tasks—summarization, text generation, question answering (extractive and abstractive), and named entity recognition—the study demonstrates significant performance gains over baselines, with clear trade-offs between full fine-tuning and parameter-efficient approaches like LoRA. The approach leverages adaptive tokenization, Longformer-extensions for long documents, and domain-pretraining (e.g., SciBERT) to boost effectiveness in scientific contexts, using datasets from ArXiv, PubMedQA, SQuAD, CoNLL2003, SciERC, and GENIA. Overall, Knowledge AI highlights the practical viability of fine-tuned LLMs for scientific knowledge extraction and dissemination, offering a foundation for accessible science communication and knowledge discovery, while noting efficiency and scalability considerations for real-world deployment.
Abstract
This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains and to create a deep learning framework: Knowledge AI. As a part of this framework, we employ pre-trained models and fine-tune them on datasets in the scientific domain. The models are adapted for four key Natural Language Processing (NLP) tasks: summarization, text generation, question answering, and named entity recognition. Our results indicate that domain-specific fine-tuning significantly enhances model performance in each of these tasks, thereby improving their applicability for scientific contexts. This adaptation enables non-experts to efficiently query and extract information within targeted scientific fields, demonstrating the potential of fine-tuned LLMs as a tool for knowledge discovery in the sciences.
