Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software
Andreas Baumann, Peter Eberhard
TL;DR
This study tackles hallucinations in LLMs when working with closed-source simulation software by evaluating Retrieval-Augmented Generation (RAG) on Pasimodo using both a commercial RAG (NotebookLM) and an open-source variant (AnythingLLM) with local LLMs. It analyzes system configurations, data preprocessing, and six task-focused prompts to assess how prompting strategies and supplementary documents affect performance. Key findings show NotebookLM often outperforming local models, but all systems suffer from incomplete information and potential outdated content; tailored prompts, additional general literature, and error-driven refinements significantly improve results. The work highlights practical paths and limitations for deploying RAG in closed-source environments and points to future work on retrieval quality, larger context windows, and open-weight LLMs for secure, local use.
Abstract
Large Language Models (LLMs) are tools that have become indispensable in development and programming. However, they suffer from hallucinations, especially when dealing with unknown knowledge. This is particularly the case when LLMs are to be used to support closed-source software applications. Retrieval-Augmented Generation (RAG) offers an approach to use additional knowledge alongside the pre-trained knowledge of the LLM to respond to user prompts. Possible tasks range from a smart-autocomplete, text extraction for question answering, model summarization, component explaining, compositional reasoning, to creation of simulation components and complete input models. This work tests existing RAG systems for closed-source simulation frameworks, in our case the mesh-free simulation software Pasimodo. Since data protection and intellectual property rights are particularly important for problems solved with closed-source software, the tests focus on execution using local LLMs. In order to enable smaller institutions to use the systems, smaller language models will be tested first. The systems show impressive results, but often fail due to insufficient information. Different approaches for improving response quality are tested. In particular, tailoring the information provided to the LLMs dependent to the prompts proves to be a significant improvement. This demonstrates the great potential and the further work needed to improve information retrieval for closed-source simulation models.
