Table of Contents
Fetching ...

A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges

Philippe J. Giabbanelli

TL;DR

The paper addresses the challenge of effectively using large language models in Modeling & Simulation by providing a practical, pipeline-focused guide that highlights core elements (prompts, parameters, augmentation), non-determinism, and integration architectures. It advocates principled design, diagnostic strategies, and empirical evaluation, with detailed discussion of RAG and LoRA as knowledge augmentation techniques and the importance of measuring and mitigating non-determinism using DoE and robust metrics like $TAR@N$. The work emphasizes translating informal modeling requirements into formal tool inputs rather than replacing specialized tools, and it proposes architectures that balance efficiency, reproducibility, and scalability through adapters and multi-tenant serving. Overall, the paper offers concrete guidelines to help M&S practitioners decide when, how, and whether to rely on LLMs, aiming to improve reliability, interpretability, and governance in LLM-enabled workflows, including considerations for future multimodal models and unlearning approaches.

Abstract

Large language models (LLMs) have rapidly become familiar tools to researchers and practitioners. Concepts such as prompting, temperature, or few-shot examples are now widely recognized, and LLMs are increasingly used in Modeling & Simulation (M&S) workflows. However, practices that appear straightforward may introduce subtle issues, unnecessary complexity, or may even lead to inferior results. Adding more data can backfire (e.g., deteriorating performance through model collapse or inadvertently wiping out existing guardrails), spending time on fine-tuning a model can be unnecessary without a prior assessment of what it already knows, setting the temperature to 0 is not sufficient to make LLMs deterministic, providing a large volume of M&S data as input can be excessive (LLMs cannot attend to everything) but naive simplifications can lose information. We aim to provide comprehensive and practical guidance on how to use LLMs, with an emphasis on M&S applications. We discuss common sources of confusion, including non-determinism, knowledge augmentation (including RAG and LoRA), decomposition of M&S data, and hyper-parameter settings. We emphasize principled design choices, diagnostic strategies, and empirical evaluation, with the goal of helping modelers make informed decisions about when, how, and whether to rely on LLMs.

A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges

TL;DR

The paper addresses the challenge of effectively using large language models in Modeling & Simulation by providing a practical, pipeline-focused guide that highlights core elements (prompts, parameters, augmentation), non-determinism, and integration architectures. It advocates principled design, diagnostic strategies, and empirical evaluation, with detailed discussion of RAG and LoRA as knowledge augmentation techniques and the importance of measuring and mitigating non-determinism using DoE and robust metrics like . The work emphasizes translating informal modeling requirements into formal tool inputs rather than replacing specialized tools, and it proposes architectures that balance efficiency, reproducibility, and scalability through adapters and multi-tenant serving. Overall, the paper offers concrete guidelines to help M&S practitioners decide when, how, and whether to rely on LLMs, aiming to improve reliability, interpretability, and governance in LLM-enabled workflows, including considerations for future multimodal models and unlearning approaches.

Abstract

Large language models (LLMs) have rapidly become familiar tools to researchers and practitioners. Concepts such as prompting, temperature, or few-shot examples are now widely recognized, and LLMs are increasingly used in Modeling & Simulation (M&S) workflows. However, practices that appear straightforward may introduce subtle issues, unnecessary complexity, or may even lead to inferior results. Adding more data can backfire (e.g., deteriorating performance through model collapse or inadvertently wiping out existing guardrails), spending time on fine-tuning a model can be unnecessary without a prior assessment of what it already knows, setting the temperature to 0 is not sufficient to make LLMs deterministic, providing a large volume of M&S data as input can be excessive (LLMs cannot attend to everything) but naive simplifications can lose information. We aim to provide comprehensive and practical guidance on how to use LLMs, with an emphasis on M&S applications. We discuss common sources of confusion, including non-determinism, knowledge augmentation (including RAG and LoRA), decomposition of M&S data, and hyper-parameter settings. We emphasize principled design choices, diagnostic strategies, and empirical evaluation, with the goal of helping modelers make informed decisions about when, how, and whether to rely on LLMs.
Paper Structure (14 sections, 7 figures, 4 tables)

This paper contains 14 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Rather than a single prompt–response interaction, LLM-based systems are organized as pipelines: we need to know what version of the data is used (e.g., via data version control), automate the experiments (e.g., through MLflow), ensure that personally identifiable information is detected and removed (e.g., using Presidio), evaluate with respect to fairness, etc. Orchestration (e.g., based on LangChain) is necessary to coordinate these many aspects.
  • Figure 2: Inference-time generation pipeline for LLMs. Gray boxes indicate stages that are deterministic given a fixed prompt, tokenization scheme, and model weights. Blue boxes indicate stages where stochasticity may be introduced through decoding hyper-parameters and sampling. Generation proceeds autoregressively by repeatedly appending selected tokens to the context.
  • Figure 3: Illustrative response surfaces showing that the optimal temperature depends on the LLM and interacts with simulation parameters.
  • Figure 4: Retrieval-augmented generation (RAG) pipeline. Gray boxes denote stages that are deterministic given fixed embeddings, indices, and model weights, while blue boxes denote stages where stochasticity may be introduced through decoding and sampling. Retrieved contextual knowledge is incorporated by augmenting the prompt prior to tokenization and generation, not by modifying model parameters.
  • Figure 5: LLMs can mediate between informal modeling requirements and specialized tools to avoid a sub-optimal and time-consuming reimplementation of established methods. LLMs would have to translate and handle feedback from the tools.
  • ...and 2 more figures