Table of Contents
Fetching ...

How Can We Effectively Use LLMs for Phishing Detection?: Evaluating the Effectiveness of Large Language Model-based Phishing Detection Models

Fujiao Ji, Doowon Kim

TL;DR

This study systematically evaluates seven LLMs, including commercial and open-source models, for phishing detection and brand identification using multimodal inputs (screenshots, logos, HTML, URLs). It benchmarks against deep-learning detectors and analyzes how input modalities, temperature, and prompt design affect performance, finding that commercial LLMs excel at phishing detection while DL models excel on benign samples, and that screenshots are the most informative input for brand identification at low temperatures. The study identifies failure modes (e.g., missing phishing signals, HTML truncation, noise from multimodal inputs) and provides configuration guidelines (screenshots with zero temperature, HTML as auxiliary input) to maximize accuracy. The work contributes a thorough, open analysis of LLM-based phishing defenses and will share a refined dataset to support reproducibility and further research.

Abstract

Large language models (LLMs) have emerged as a promising phishing detection mechanism, addressing the limitations of traditional deep learning-based detectors, including poor generalization to previously unseen websites and a lack of interpretability. However, LLMs' effectiveness for phishing detection remains unexplored. This study investigates how to effectively leverage LLMs for phishing detection (including target brand identification) by examining the impact of input modalities (screenshots, logos, HTML, and URLs), temperature settings, and prompt engineering strategies. Using a dataset of 19,131 real-world phishing websites and 243 benign sites, we evaluate seven LLMs -- two commercial models (GPT 4.1 and Gemini 2.0 flash) and five open-source models (Qwen, Llama, Janus, DeepSeek-VL2, and R1) -- alongside two deep learning (DL)-based baselines (PhishIntention and Phishpedia). Our findings reveal that commercial LLMs generally outperform open-source models in phishing detection, while DL models demonstrate better performance on benign samples. For brand identification, screenshot inputs achieve optimal results, with commercial LLMs reaching 93-95% accuracy and open-source models, particularly Qwen, achieving up to 92%. However, incorporating multiple input modalities simultaneously or applying one-shot prompts does not consistently enhance performance and may degrade results. Furthermore, higher temperature values reduce performance. Based on these results, we recommend using screenshot inputs with zero temperature to maximize accuracy for LLM-based detectors with HTML serving as auxiliary context when screenshot information is insufficient.

How Can We Effectively Use LLMs for Phishing Detection?: Evaluating the Effectiveness of Large Language Model-based Phishing Detection Models

TL;DR

This study systematically evaluates seven LLMs, including commercial and open-source models, for phishing detection and brand identification using multimodal inputs (screenshots, logos, HTML, URLs). It benchmarks against deep-learning detectors and analyzes how input modalities, temperature, and prompt design affect performance, finding that commercial LLMs excel at phishing detection while DL models excel on benign samples, and that screenshots are the most informative input for brand identification at low temperatures. The study identifies failure modes (e.g., missing phishing signals, HTML truncation, noise from multimodal inputs) and provides configuration guidelines (screenshots with zero temperature, HTML as auxiliary input) to maximize accuracy. The work contributes a thorough, open analysis of LLM-based phishing defenses and will share a refined dataset to support reproducibility and further research.

Abstract

Large language models (LLMs) have emerged as a promising phishing detection mechanism, addressing the limitations of traditional deep learning-based detectors, including poor generalization to previously unseen websites and a lack of interpretability. However, LLMs' effectiveness for phishing detection remains unexplored. This study investigates how to effectively leverage LLMs for phishing detection (including target brand identification) by examining the impact of input modalities (screenshots, logos, HTML, and URLs), temperature settings, and prompt engineering strategies. Using a dataset of 19,131 real-world phishing websites and 243 benign sites, we evaluate seven LLMs -- two commercial models (GPT 4.1 and Gemini 2.0 flash) and five open-source models (Qwen, Llama, Janus, DeepSeek-VL2, and R1) -- alongside two deep learning (DL)-based baselines (PhishIntention and Phishpedia). Our findings reveal that commercial LLMs generally outperform open-source models in phishing detection, while DL models demonstrate better performance on benign samples. For brand identification, screenshot inputs achieve optimal results, with commercial LLMs reaching 93-95% accuracy and open-source models, particularly Qwen, achieving up to 92%. However, incorporating multiple input modalities simultaneously or applying one-shot prompts does not consistently enhance performance and may degrade results. Furthermore, higher temperature values reduce performance. Based on these results, we recommend using screenshot inputs with zero temperature to maximize accuracy for LLM-based detectors with HTML serving as auxiliary context when screenshot information is insufficient.

Paper Structure

This paper contains 42 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Overview of Our Evaluation Experiment.
  • Figure 2: Screenshot Input Fail Case for GPT.
  • Figure 3: Logo Input Fail Case for GPT.