Table of Contents
Fetching ...

Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs

Georg Goldenits, Philip Koenig, Sebastian Raubitzek, Andreas Ekelhart

TL;DR

This study benchmarks 15 small open-source language models (≤70B parameters) on phishing-website detection using raw HTML, evaluating both runtime cost and detection performance across two experiments. While open-source LLMs do not yet match top proprietary models, several 70B and mid-sized SLMs offer strong performance (approaching 0.89 F1 and near-0.9 accuracy) with clear advantages in privacy and on-premise deployment. The work details dataset construction, prompt design, and hardware considerations, and provides a cost–benefit framework to guide deployment choices between proprietary APIs and local models. It also outlines practical recommendations and future directions for improving SLM-based phishing detection through fine-tuning, retrieval augmentation, and multimodal extensions. Overall, the results support the viability of small, locally hosted models as privacy-preserving, cost-effective components in phishing-detection systems, while acknowledging the ongoing gap to state-of-the-art proprietary systems.

Abstract

Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper investigates the feasibility of small language models (SLMs) for detecting phishing websites using only their raw HTML code. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used Small Language Models (SLMs), ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption, demonstrating that while SLMs underperform compared to state-of-the-art proprietary LLMs, they can still provide a viable and scalable alternative to external LLM services. By presenting a comparative analysis of costs and benefits, this work lays the foundation for future research on the adaptation, fine-tuning, and deployment of SLMs in phishing detection systems, aiming to balance security effectiveness and economic practicality.

Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs

TL;DR

This study benchmarks 15 small open-source language models (≤70B parameters) on phishing-website detection using raw HTML, evaluating both runtime cost and detection performance across two experiments. While open-source LLMs do not yet match top proprietary models, several 70B and mid-sized SLMs offer strong performance (approaching 0.89 F1 and near-0.9 accuracy) with clear advantages in privacy and on-premise deployment. The work details dataset construction, prompt design, and hardware considerations, and provides a cost–benefit framework to guide deployment choices between proprietary APIs and local models. It also outlines practical recommendations and future directions for improving SLM-based phishing detection through fine-tuning, retrieval augmentation, and multimodal extensions. Overall, the results support the viability of small, locally hosted models as privacy-preserving, cost-effective components in phishing-detection systems, while acknowledging the ongoing gap to state-of-the-art proprietary systems.

Abstract

Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper investigates the feasibility of small language models (SLMs) for detecting phishing websites using only their raw HTML code. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used Small Language Models (SLMs), ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption, demonstrating that while SLMs underperform compared to state-of-the-art proprietary LLMs, they can still provide a viable and scalable alternative to external LLM services. By presenting a comparative analysis of costs and benefits, this work lays the foundation for future research on the adaptation, fine-tuning, and deployment of SLMs in phishing detection systems, aiming to balance security effectiveness and economic practicality.

Paper Structure

This paper contains 26 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Methodological approach of the benchmarking pipeline.
  • Figure 2: Runtime for each model with the D5 and D50 dataset.
  • Figure 3: Correlation between model analysis runtime and prompt length. Each dot represents the token count of one website's HTML code and the corresponding analysis time. Due to websites being analysed multiple times, there is significant overlap between the 200 dots. As a consequence of the sampling scheme, there are a lot more websites towards the lower end of the token count, as they are more prevalent in the dataset.
  • Figure 4: Distribution of the logarithm of the relative analysis time differences between the D5 and D50 datasets for each model. The distributions are based on 200 analysis runs per model, meaning that each website’s time difference is included five times (once per run). Values below 0 indicate that analysing a website with fewer HTML tokens (D5) was faster than analysing the same website with more tokens (D50).