Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs
Georg Goldenits, Philip Koenig, Sebastian Raubitzek, Andreas Ekelhart
TL;DR
This study benchmarks 15 small open-source language models (≤70B parameters) on phishing-website detection using raw HTML, evaluating both runtime cost and detection performance across two experiments. While open-source LLMs do not yet match top proprietary models, several 70B and mid-sized SLMs offer strong performance (approaching 0.89 F1 and near-0.9 accuracy) with clear advantages in privacy and on-premise deployment. The work details dataset construction, prompt design, and hardware considerations, and provides a cost–benefit framework to guide deployment choices between proprietary APIs and local models. It also outlines practical recommendations and future directions for improving SLM-based phishing detection through fine-tuning, retrieval augmentation, and multimodal extensions. Overall, the results support the viability of small, locally hosted models as privacy-preserving, cost-effective components in phishing-detection systems, while acknowledging the ongoing gap to state-of-the-art proprietary systems.
Abstract
Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper investigates the feasibility of small language models (SLMs) for detecting phishing websites using only their raw HTML code. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used Small Language Models (SLMs), ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption, demonstrating that while SLMs underperform compared to state-of-the-art proprietary LLMs, they can still provide a viable and scalable alternative to external LLM services. By presenting a comparative analysis of costs and benefits, this work lays the foundation for future research on the adaptation, fine-tuning, and deployment of SLMs in phishing detection systems, aiming to balance security effectiveness and economic practicality.
