Table of Contents
Fetching ...

Can Small GenAI Language Models Rival Large Language Models in Understanding Application Behavior?

Mohammad Meymani, Hamed Jelodar, Parisa Hamedi, Roozbeh Razavi-Far, Ali A. Ghorbani

TL;DR

This paper investigates whether small GenAI language models can rival large LLMs in understanding application behavior, using malware detection as the benchmark. It systematically compares SLMs and LLMs with both a classification-head and prompt-based approaches on a balanced SBAN-derived 10k-code dataset, highlighting trade-offs between accuracy and computational efficiency. The results show that small models like Phi-4-mini and Qwen-2.5-7B can achieve competitive precision and recall while offering faster inference and lower resource demands, though large models still tend to achieve higher overall accuracy. The findings suggest a practical, resource-aware path for deploying GenAI in real-world malware analysis, and point to future work in fine-tuning, hybrid analysis, and model compression to further close the gap with very large LLMs.

Abstract

Generative AI (GenAI) models, particularly large language models (LLMs), have transformed multiple domains, including natural language processing, software analysis, and code understanding. Their ability to analyze and generate code has enabled applications such as source code summarization, behavior analysis, and malware detection. In this study, we systematically evaluate the capabilities of both small and large GenAI language models in understanding application behavior, with a particular focus on malware detection as a representative task. While larger models generally achieve higher overall accuracy, our experiments show that small GenAI models maintain competitive precision and recall, offering substantial advantages in computational efficiency, faster inference, and deployment in resource-constrained environments. We provide a detailed comparison across metrics such as accuracy, precision, recall, and F1-score, highlighting each model's strengths, limitations, and operational feasibility. Our findings demonstrate that small GenAI models can effectively complement large ones, providing a practical balance between performance and resource efficiency in real-world application behavior analysis.

Can Small GenAI Language Models Rival Large Language Models in Understanding Application Behavior?

TL;DR

This paper investigates whether small GenAI language models can rival large LLMs in understanding application behavior, using malware detection as the benchmark. It systematically compares SLMs and LLMs with both a classification-head and prompt-based approaches on a balanced SBAN-derived 10k-code dataset, highlighting trade-offs between accuracy and computational efficiency. The results show that small models like Phi-4-mini and Qwen-2.5-7B can achieve competitive precision and recall while offering faster inference and lower resource demands, though large models still tend to achieve higher overall accuracy. The findings suggest a practical, resource-aware path for deploying GenAI in real-world malware analysis, and point to future work in fine-tuning, hybrid analysis, and model compression to further close the gap with very large LLMs.

Abstract

Generative AI (GenAI) models, particularly large language models (LLMs), have transformed multiple domains, including natural language processing, software analysis, and code understanding. Their ability to analyze and generate code has enabled applications such as source code summarization, behavior analysis, and malware detection. In this study, we systematically evaluate the capabilities of both small and large GenAI language models in understanding application behavior, with a particular focus on malware detection as a representative task. While larger models generally achieve higher overall accuracy, our experiments show that small GenAI models maintain competitive precision and recall, offering substantial advantages in computational efficiency, faster inference, and deployment in resource-constrained environments. We provide a detailed comparison across metrics such as accuracy, precision, recall, and F1-score, highlighting each model's strengths, limitations, and operational feasibility. Our findings demonstrate that small GenAI models can effectively complement large ones, providing a practical balance between performance and resource efficiency in real-world application behavior analysis.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The workflow of our proposed approaches.
  • Figure 2: Grouped bar chart comparing weighted average and macro average on Precision, Recall, and F1-score across both classes for each model.
  • Figure 3: Line charts showing the variation of performance metrics. Each chart focuses on a single metric for clearer model performance comparison.
  • Figure 4: Heatmap representing Precision, Recall, and F1-score for both classes across all models. Warmer shades indicate higher scores, providing a compact and intuitive view of model performance.