Table of Contents
Fetching ...

Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning

Stephen C. Gravereaux, Sheikh Rabiul Islam

TL;DR

The paper investigates whether Low-Rank Adaptation (LoRA) can approximate full-parameter fine-tuning for LLM-based malware explanations grounded in SHAP features from EMBER. A standardized evaluation framework using BLEU, ROUGE, and semantic similarity compares five LoRA configurations against a full-finetuned baseline on 1,050 EMBER-derived samples. Full fine-tuning generally achieves the highest explanation quality, but mid-range LoRA (~15.5% trainable parameters) delivers competitive results with substantial reductions in model size and training time, enabling deployment in resource-constrained settings. The findings guide when to employ LoRA versus full fine-tuning and point to scaling experiments with larger LLMs and datasets to further optimize interpretability and efficiency in malware detection systems.

Abstract

This study examines whether Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification. Achieving trustworthy malware detection, particularly when LLMs are involved, remains a significant challenge. We developed an evaluation framework using Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Semantic Similarity Metrics to benchmark explanation quality across five LoRA configurations and a fully fine-tuned baseline. Results indicate that full fine-tuning achieves the highest overall scores, with BLEU and ROUGE improvements of up to 10% over LoRA variants. However, mid-range LoRA models deliver competitive performance exceeding full fine-tuning on two metrics while reducing model size by approximately 81% and training time by over 80% on a LoRA model with 15.5% trainable parameters. These findings demonstrate that LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality. By providing feature-driven natural language explanations for malware classifications, this approach enhances transparency, analyst confidence, and operational scalability in malware detection systems.

Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning

TL;DR

The paper investigates whether Low-Rank Adaptation (LoRA) can approximate full-parameter fine-tuning for LLM-based malware explanations grounded in SHAP features from EMBER. A standardized evaluation framework using BLEU, ROUGE, and semantic similarity compares five LoRA configurations against a full-finetuned baseline on 1,050 EMBER-derived samples. Full fine-tuning generally achieves the highest explanation quality, but mid-range LoRA (~15.5% trainable parameters) delivers competitive results with substantial reductions in model size and training time, enabling deployment in resource-constrained settings. The findings guide when to employ LoRA versus full fine-tuning and point to scaling experiments with larger LLMs and datasets to further optimize interpretability and efficiency in malware detection systems.

Abstract

This study examines whether Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification. Achieving trustworthy malware detection, particularly when LLMs are involved, remains a significant challenge. We developed an evaluation framework using Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Semantic Similarity Metrics to benchmark explanation quality across five LoRA configurations and a fully fine-tuned baseline. Results indicate that full fine-tuning achieves the highest overall scores, with BLEU and ROUGE improvements of up to 10% over LoRA variants. However, mid-range LoRA models deliver competitive performance exceeding full fine-tuning on two metrics while reducing model size by approximately 81% and training time by over 80% on a LoRA model with 15.5% trainable parameters. These findings demonstrate that LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality. By providing feature-driven natural language explanations for malware classifications, this approach enhances transparency, analyst confidence, and operational scalability in malware detection systems.

Paper Structure

This paper contains 10 sections, 13 figures.

Figures (13)

  • Figure 1: Pipeline Flowchart
  • Figure 2: Truncated PE file extracted from the JSONL test features
  • Figure 3: Located within the official EMBER dataset, lines 496–510 show the feature list.
  • Figure 4: The 9th category is added on line 526 only for feature version 2 (the current version).
  • Figure 5: Full parameter output on EMBER PE sample
  • ...and 8 more figures