Detecting Phishing Sites Using ChatGPT
Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba
TL;DR
This work tackles the automated detection of phishing sites by leveraging large language models (LLMs) in a web-crawler–driven pipeline. The authors introduce ChatPhishDetector, which gathers site data (final URL, HTML, and screenshots), simplifies inputs, and prompts multimodal or text-only LLMs to assess phishing risk with outputs in JSON. Empirical results show that GPT-4V achieves the highest precision (≈98.7%) and recall (≈99.6%), substantially outperforming baselines and non-vision models, largely due to multimodal analysis of branding, domain legitimacy, and social engineering cues. The study highlights practical considerations, including cost, latency, limitations, and risks like prompt injection, and demonstrates the potential of LLMs for scalable, multilingual phishing detection in real-world cybersecurity workflows.
Abstract
The emergence of Large Language Models (LLMs), including ChatGPT, is having a significant impact on a wide range of fields. While LLMs have been extensively researched for tasks such as code generation and text synthesis, their application in detecting malicious web content, particularly phishing sites, has been largely unexplored. To combat the rising tide of cyber attacks due to the misuse of LLMs, it is important to automate detection by leveraging the advanced capabilities of LLMs. In this paper, we propose a novel system called ChatPhishDetector that utilizes LLMs to detect phishing sites. Our system involves leveraging a web crawler to gather information from websites, generating prompts for LLMs based on the crawled data, and then retrieving the detection results from the responses generated by the LLMs. The system enables us to detect multilingual phishing sites with high accuracy by identifying impersonated brands and social engineering techniques in the context of the entire website, without the need to train machine learning models. To evaluate the performance of our system, we conducted experiments on our own dataset and compared it with baseline systems and several LLMs. The experimental results using GPT-4V demonstrated outstanding performance, with a precision of 98.7% and a recall of 99.6%, outperforming the detection results of other LLMs and existing systems. These findings highlight the potential of LLMs for protecting users from online fraudulent activities and have important implications for enhancing cybersecurity measures.
