Towards Large Reasoning Models for Agriculture
Hossein Zaremehrjerdi, Shreyan Ganguly, Ashlyn Rairdin, Elizabeth Tranel, Benjamin Feuer, Juan Ignacio Di Salvo, Srikanth Panthulugiri, Hernan Torres Pacin, Victoria Moser, Sarah Jones, Joscif G Raigne, Yanben Shen, Heidi M. Dornath, Aditya Balu, Adarsh Krishnamurthy, Asheesh K Singh, Arti Singh, Baskar Ganapathysubramanian, Chinmay Hegde, Soumik Sarkar
TL;DR
This paper introduces AgThoughts, a large, expert-in-the-loop repository of 44.6K agricultural Q&A pairs with reasoning traces, and AgReason, a 100-question open-ended benchmark designed to evaluate context-rich agronomic reasoning. It demonstrates that large reasoning models outperform standard LLMs on agricultural reasoning tasks, yet performance remains limited (best baseline ~36% accuracy), motivating the development of AgThinker, a family of small, domain-adapted models trained with LoRA on consumer hardware. The authors propose an LLM-as-Judge evaluation framework and show that extensive expert curation and reasoning traces can unlock domain-specific reasoning in LLMs, while also revealing substantial room for improvement and future work in multimodal and expanded geographic coverage. Overall, the work provides a concrete, expert-driven path for advancing agricultural decision-support AI with specialized datasets, benchmarks, and lightweight models capable of real-world deployment.
Abstract
Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: https://baskargroup.github.io/Ag_reasoning/
