Table of Contents
Fetching ...

S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and Document

Kareem Shaik, Dali Wang, Weijian Zheng, Qinglei Cao, Heng Fan, Peter Schwartz, Yunhe Feng

TL;DR

The paper tackles the difficulty of understanding large-scale scientific software, characterized by multi-language codebases, vast code size, and sparse documentation. It introduces S3LLM, an open-source framework built on open-source LLaMA-2 models that enables natural-language interaction with source code, code metadata, and technical documents, using a domain-specific language (Feature Query Language, FQL) and retrieval-augmented generation (RAG) via LangChain. The framework comprises three processing streams—source code analysis, metadata comprehension (DOT, SQL, and SPEL), and technical document interpretation—demonstrated through a case study on the E3SM model, with a focus on translating NL queries into precise DSLs and retrieving relevant external information to augment responses. Key contributions include the NL-to-FQL translation pipeline, multi-format metadata handling, and RAG-based document querying, all implemented with configurable LLaMA-2 model sizes ($7B$, $13B$, and $70B$) to balance speed and accuracy, and released as open source for broad use in scientific computing.

Abstract

The understanding of large-scale scientific software poses significant challenges due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examination of source code, code metadata, and summarized information in conjunction with textual technical reports in an interactive, conversational manner through a user-friendly interface. S3LLM leverages open-source LLaMA-2 models to enhance code analysis through the automatic transformation of natural language queries into domain-specific language (DSL) queries. Specifically, it translates these queries into Feature Query Language (FQL), enabling efficient scanning and parsing of entire code repositories. In addition, S3LLM is equipped to handle diverse metadata types, including DOT, SQL, and customized formats. Furthermore, S3LLM incorporates retrieval augmented generation (RAG) and LangChain technologies to directly query extensive documents. S3LLM demonstrates the potential of using locally deployed open-source LLMs for the rapid understanding of large-scale scientific computing software, eliminating the need for extensive coding expertise, and thereby making the process more efficient and effective. S3LLM is available at https://github.com/ResponsibleAILab/s3llm.

S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and Document

TL;DR

The paper tackles the difficulty of understanding large-scale scientific software, characterized by multi-language codebases, vast code size, and sparse documentation. It introduces S3LLM, an open-source framework built on open-source LLaMA-2 models that enables natural-language interaction with source code, code metadata, and technical documents, using a domain-specific language (Feature Query Language, FQL) and retrieval-augmented generation (RAG) via LangChain. The framework comprises three processing streams—source code analysis, metadata comprehension (DOT, SQL, and SPEL), and technical document interpretation—demonstrated through a case study on the E3SM model, with a focus on translating NL queries into precise DSLs and retrieving relevant external information to augment responses. Key contributions include the NL-to-FQL translation pipeline, multi-format metadata handling, and RAG-based document querying, all implemented with configurable LLaMA-2 model sizes (, , and ) to balance speed and accuracy, and released as open source for broad use in scientific computing.

Abstract

The understanding of large-scale scientific software poses significant challenges due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examination of source code, code metadata, and summarized information in conjunction with textual technical reports in an interactive, conversational manner through a user-friendly interface. S3LLM leverages open-source LLaMA-2 models to enhance code analysis through the automatic transformation of natural language queries into domain-specific language (DSL) queries. Specifically, it translates these queries into Feature Query Language (FQL), enabling efficient scanning and parsing of entire code repositories. In addition, S3LLM is equipped to handle diverse metadata types, including DOT, SQL, and customized formats. Furthermore, S3LLM incorporates retrieval augmented generation (RAG) and LangChain technologies to directly query extensive documents. S3LLM demonstrates the potential of using locally deployed open-source LLMs for the rapid understanding of large-scale scientific computing software, eliminating the need for extensive coding expertise, and thereby making the process more efficient and effective. S3LLM is available at https://github.com/ResponsibleAILab/s3llm.
Paper Structure (14 sections, 1 figure, 2 tables)