LLM assisted web application functional requirements generation: A case study of four popular LLMs over a Mess Management System
Rashmi Gupta, Aditya K Gupta, Aarav Jain, Avinash C Pandey, Atul Gupta
TL;DR
This study benchmarks four contemporary LLMs—GPT-4o, Claude-3.7, Gemini-2.0, and DeepSeek-V3—on generating functional requirements artifacts (use cases, workflows, and business rules) for a Mess Management System. Using zero-shot prompts and a fixed reference specification, it evaluates artifacts on syntactic/semantic correctness, consistency, non-ambiguity, and completeness (via Precision, Recall, F1). The results show all models produce structurally sound artifacts but differ in completeness and recall, with Claude Generally most complete and DeepSeek generating the most rules but with limited completeness; Gemini provides high precision but lower recall, and GPT sits between. The findings suggest LLM-assisted RE can reduce manual effort but should be combined with human review, and future work should explore domain-specific fine-tuning and iterative prompting to improve domain-rule generation and downstream integration.
Abstract
Like any other discipline, Large Language Models (LLMs) have significantly impacted software engineering by helping developers generate the required artifacts across various phases of software development. This paper presents a case study comparing the performance of popular LLMs GPT, Claude, Gemini, and DeepSeek in generating functional specifications that include use cases, business rules, and collaborative workflows for a web application, the Mess Management System. The study evaluated the quality of LLM generated use cases, business rules, and collaborative workflows in terms of their syntactic and semantic correctness, consistency, non ambiguity, and completeness compared to the reference specifications against the zero-shot prompted problem statement. Our results suggested that all four LLMs can specify syntactically and semantically correct, mostly non-ambiguous artifacts. Still, they may be inconsistent at times and may differ significantly in the completeness of the generated specification. Claude and Gemini generated all the reference use cases, with Claude achieving the most complete but somewhat redundant use case specifications. Similar results were obtained for specifying workflows. However, all four LLMs struggled to generate relevant Business Rules, with DeepSeek generating the most reference rules but with less completeness. Overall, Claude generated more complete specification artifacts, while Gemini was more precise in the specifications it generated.
