Table of Contents
Fetching ...

NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description

Hamed Jelodar, Mohammad Meymani, Parisa Hamedi, Tochukwu Emmanuel Nwankwo, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani

TL;DR

This article presents NLD-LLM, a systematic framework to evaluate how diverse transformer models generate natural language descriptions from code and other structured inputs. It combines standardized prompt design, an iterative refinement process, and semantic/structural metrics to compare small- and large-scale LLMs. The results show smaller models, when well-prompted, can rival larger models in semantic fidelity and usefulness for code description tasks, informing efficient deployment. The framework advances reproducible, cost-conscious evaluation of NLD capabilities across architectures.

Abstract

Natural Language Description (NLD) is a Natural Language Processing (NLP) task that requires models to generate structured and meaningful outputs from natural language inputs. In this work, we propose NLD-LLM, a systematic NLP framework to evaluate the performance of language models to generate accurate and concise source code descriptions. This framework incorporates a diverse set of transformer models, including Qwen, DeepSeek, Phi, LLaMA, and Mistral, spanning various sizes, architectures, and training approaches. Central to NLD-LLM is a comprehensive prompt design strategy that includes standardized formatting, clear task guidance, and NLD prompting, ensuring fair and consistent evaluation. Additionally, we apply an iterative refinement process to improve output's quality and assess the model's adaptability. Using semantic and structural metrics, our analysis demonstrates that prompt engineering significantly impacts the effectiveness of the model such that smaller models often performing competitively when supported by well-crafted prompts.

NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description

TL;DR

This article presents NLD-LLM, a systematic framework to evaluate how diverse transformer models generate natural language descriptions from code and other structured inputs. It combines standardized prompt design, an iterative refinement process, and semantic/structural metrics to compare small- and large-scale LLMs. The results show smaller models, when well-prompted, can rival larger models in semantic fidelity and usefulness for code description tasks, informing efficient deployment. The framework advances reproducible, cost-conscious evaluation of NLD capabilities across architectures.

Abstract

Natural Language Description (NLD) is a Natural Language Processing (NLP) task that requires models to generate structured and meaningful outputs from natural language inputs. In this work, we propose NLD-LLM, a systematic NLP framework to evaluate the performance of language models to generate accurate and concise source code descriptions. This framework incorporates a diverse set of transformer models, including Qwen, DeepSeek, Phi, LLaMA, and Mistral, spanning various sizes, architectures, and training approaches. Central to NLD-LLM is a comprehensive prompt design strategy that includes standardized formatting, clear task guidance, and NLD prompting, ensuring fair and consistent evaluation. Additionally, we apply an iterative refinement process to improve output's quality and assess the model's adaptability. Using semantic and structural metrics, our analysis demonstrates that prompt engineering significantly impacts the effectiveness of the model such that smaller models often performing competitively when supported by well-crafted prompts.

Paper Structure

This paper contains 29 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A general view of the research model for NLD generation
  • Figure 2: Two-step Pipeline Prompt
  • Figure 3: A pair plot illustrating the relationships between different LLM evaluation metrics.
  • Figure 4: Heatmap indicating the model scores
  • Figure 5: A The stacked area plot for overall model performance