Table of Contents
Fetching ...

Large Language Models in Software Documentation and Modeling: A Literature Review and Findings

Lukas Radosky, Ivan Polasek

TL;DR

The paper surveys 57 recent works (2024–2025) on applying large language models to software documentation and modeling, organizing them by tasks such as commit message generation, issue tracking, StackOverflow title/tag generation, sentiment analysis, code understanding, code summarization, and technical-document analysis. It highlights predominant reliance on zero-shot prompting and promptless approaches, common datasets (e.g., MCMD, CodeSearchNet, CodeXGlue, PCSD), and standard metrics (BLEU, ROUGE, METEOR, precision/recall/F1, MAE/RMSE), with GPT-4 frequently used as an evaluator. The findings reveal that LLMs primarily enhance existing SE workflows rather than redefining them, offering speed and quality improvements while leaving revolutionary shifts to future tuning, ensemble methods, and broader multi-agent or DSL-focused research. The study outlines practical implications for researchers and practitioners and points to future directions including deeper exploration of MIL/MASE contexts, more diverse evaluation, and greater emphasis on modeling and documentation tasks in real-world settings.

Abstract

Generative artificial intelligence attracts significant attention, especially with the introduction of large language models. Its capabilities are being exploited to solve various software engineering tasks. Thanks to their ability to understand natural language and generate natural language responses, large language models are great for processing various software documentation artifacts. At the same time, large language models excel at understanding structured languages, having the potential for working with software programs and models. We conduct a literature review on the usage of large language models for software engineering tasks related to documentation and modeling. We analyze articles from four major venues in the area, organize them per tasks they solve, and provide an overview of used prompt techniques, metrics, approaches to human-based evaluation, and major datasets.

Large Language Models in Software Documentation and Modeling: A Literature Review and Findings

TL;DR

The paper surveys 57 recent works (2024–2025) on applying large language models to software documentation and modeling, organizing them by tasks such as commit message generation, issue tracking, StackOverflow title/tag generation, sentiment analysis, code understanding, code summarization, and technical-document analysis. It highlights predominant reliance on zero-shot prompting and promptless approaches, common datasets (e.g., MCMD, CodeSearchNet, CodeXGlue, PCSD), and standard metrics (BLEU, ROUGE, METEOR, precision/recall/F1, MAE/RMSE), with GPT-4 frequently used as an evaluator. The findings reveal that LLMs primarily enhance existing SE workflows rather than redefining them, offering speed and quality improvements while leaving revolutionary shifts to future tuning, ensemble methods, and broader multi-agent or DSL-focused research. The study outlines practical implications for researchers and practitioners and points to future directions including deeper exploration of MIL/MASE contexts, more diverse evaluation, and greater emphasis on modeling and documentation tasks in real-world settings.

Abstract

Generative artificial intelligence attracts significant attention, especially with the introduction of large language models. Its capabilities are being exploited to solve various software engineering tasks. Thanks to their ability to understand natural language and generate natural language responses, large language models are great for processing various software documentation artifacts. At the same time, large language models excel at understanding structured languages, having the potential for working with software programs and models. We conduct a literature review on the usage of large language models for software engineering tasks related to documentation and modeling. We analyze articles from four major venues in the area, organize them per tasks they solve, and provide an overview of used prompt techniques, metrics, approaches to human-based evaluation, and major datasets.
Paper Structure (17 sections, 1 figure)