Table of Contents
Fetching ...

Large Language Models: New Opportunities for Access to Science

Jutta Schnabel

TL;DR

The paper addresses how to improve access to KM3NeT's scientific data and publications by leveraging Large Language Models within an Open Science System. It introduces the LLMTuner package, a Python-based extension that builds on Retrieval Augmented Generation (RAG) and integrates with the AnythingLLM stack to manage document ingestion, embedding, workspace creation, and evaluation. A key contribution is the end-to-end LLMTuner framework, including InfoBasis-backed storage, provenance tracking, and a benchmarking approach for comparing LLMs and prompts across workspaces. The work demonstrates three KM3NeT applications—internal documentation retrieval, analysis-workflow assistance, and multilingual education—showing how LLM-driven tools can enhance findability, reproducibility, and science education within a large collaborative infrastructure.

Abstract

The adaptation of Large Language Models like ChatGPT for information retrieval from scientific data, software and publications is offering new opportunities to simplify access to and understanding of science for persons from all levels of expertise. They can become tools to both enhance the usability of the open science environment we are building as well as help to provide systematic insight to a long-built corpus of scientific publications. The uptake of Retrieval Augmented Generation-enhanced chat applications in the construction of the open science environment of the KM3NeT neutrino detectors serves as a focus point to explore and exemplify prospects for the wider application of Large Language Models for our science.

Large Language Models: New Opportunities for Access to Science

TL;DR

The paper addresses how to improve access to KM3NeT's scientific data and publications by leveraging Large Language Models within an Open Science System. It introduces the LLMTuner package, a Python-based extension that builds on Retrieval Augmented Generation (RAG) and integrates with the AnythingLLM stack to manage document ingestion, embedding, workspace creation, and evaluation. A key contribution is the end-to-end LLMTuner framework, including InfoBasis-backed storage, provenance tracking, and a benchmarking approach for comparing LLMs and prompts across workspaces. The work demonstrates three KM3NeT applications—internal documentation retrieval, analysis-workflow assistance, and multilingual education—showing how LLM-driven tools can enhance findability, reproducibility, and science education within a large collaborative infrastructure.

Abstract

The adaptation of Large Language Models like ChatGPT for information retrieval from scientific data, software and publications is offering new opportunities to simplify access to and understanding of science for persons from all levels of expertise. They can become tools to both enhance the usability of the open science environment we are building as well as help to provide systematic insight to a long-built corpus of scientific publications. The uptake of Retrieval Augmented Generation-enhanced chat applications in the construction of the open science environment of the KM3NeT neutrino detectors serves as a focus point to explore and exemplify prospects for the wider application of Large Language Models for our science.
Paper Structure (12 sections)