PLLaMa: An Open-source Large Language Model for Plant Science
Xianjun Yang, Junfeng Gao, Wenxin Xue, Erik Alexandersson
TL;DR
The paper addresses the gap in plant-science expertise in publicly available LLMs by constructing PLLaMa, an open-source extension of LLaMa-2 trained on a large plant-science corpus and further refined via instruction tuning. It demonstrates that domain-specific pretraining yields measurable gains, with initial 60% accuracy on a plant-knowledge quiz and expert-validated zero-shot responses. The work provides openly accessible checkpoints and code to enable community-driven advancement in plant science NLP, agriculture, and related fields. Overall, PLLaMa offers a practical, transparent resource to improve plant science query-answering and decision-support with public models.
Abstract
Large Language Models (LLMs) have exhibited remarkable capabilities in understanding and interacting with natural language across various sectors. However, their effectiveness is limited in specialized areas requiring high accuracy, such as plant science, due to a lack of specific expertise in these fields. This paper introduces PLLaMa, an open-source language model that evolved from LLaMa-2. It's enhanced with a comprehensive database, comprising more than 1.5 million scholarly articles in plant science. This development significantly enriches PLLaMa with extensive knowledge and proficiency in plant and agricultural sciences. Our initial tests, involving specific datasets related to plants and agriculture, show that PLLaMa substantially improves its understanding of plant science-related topics. Moreover, we have formed an international panel of professionals, including plant scientists, agricultural engineers, and plant breeders. This team plays a crucial role in verifying the accuracy of PLLaMa's responses to various academic inquiries, ensuring its effective and reliable application in the field. To support further research and development, we have made the model's checkpoints and source codes accessible to the scientific community. These resources are available for download at \url{https://github.com/Xianjun-Yang/PLLaMa}.
