PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

Tri Cao; Chengyu Huang; Yuexin Li; Huilin Wang; Amy He; Nay Oo; Bryan Hooi

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, Bryan Hooi

TL;DR

PhishAgent tackles phishing webpage detection by unifying online and offline knowledge with Multimodal Large Language Models to achieve low-latency, high-accuracy detection. It introduces a Multimodal Retriever to pull top-$k$ brands from a Brand Knowledge Base using both webpage text concepts and logos, complemented by an online knowledge search. The framework demonstrates strong performance across three real-world datasets with notable robustness to adversarial HTML and image-based attacks, and an ablation study confirms the value of each component. The work advances practical phishing defenses by enabling reliable, scalable detection that handles local brands and evolving threats in real time.

Abstract

Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the relevant top k items from offline knowledge bases, using available information from a webpage, including logos and HTML. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

TL;DR

brands from a Brand Knowledge Base using both webpage text concepts and logos, complemented by an online knowledge search. The framework demonstrates strong performance across three real-world datasets with notable robustness to adversarial HTML and image-based attacks, and an ablation study confirms the value of each component. The work advances practical phishing defenses by enabling reliable, scalable detection that handles local brands and evolving threats in real time.

Abstract

Paper Structure (32 sections, 8 equations, 2 figures, 4 tables)

This paper contains 32 sections, 8 equations, 2 figures, 4 tables.

Introduction
Related Works
Conventional Approaches
Reference-based Approaches
Search Engine-based Approaches
LLM/MLLM-based Approaches
Autonomous Agents
Multimodal Retrievers
Threat Model
Multimodal Agent
Overview
Preprocessing Module
Offline Knowledge-Based Module
Webpage Encoding
Brand Encoding
...and 17 more sections

Figures (2)

Figure 1: An overview of our phishing detector, PhishAgent.
Figure 2: Our multimodal retriever; Left: Example of how a webpage $w$ and a brand $b$ are encoded and how the retrieval score $s$ is computed; Top right: Example of how the top $k$ brands are retrieved for the webpage $w$ during inference; Bottom right: Example of our training process where contrastive learning is used to distinguish from $N$ randomly sampled negative brands (colored in blue) the positive brand (colored in orange) for the webpage $w$.

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

TL;DR

Abstract

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)