To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

Fouad Trad; Ali Chehab

To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

Fouad Trad, Ali Chehab

TL;DR

The paper investigates the viability of ensemble majority voting for phishing URL detection using LLMs. It introduces three strategies—prompt-based, model-based, and hybrid—and evaluates them across five LLMs with zero/one/two-shot prompts on a PhishStorm subset. The results show that ensembles can yield gains when component performance is similar, but do not outperform the top model when there is a large performance gap, as with GPT-4. The work provides practical guidance on when to deploy ensemble LLMs for phishing detection and outlines avenues for dynamic, weighted voting and broader LLM coverage.

Abstract

The effectiveness of Large Language Models (LLMs) significantly relies on the quality of the prompts they receive. However, even when processing identical prompts, LLMs can yield varying outcomes due to differences in their training processes. To leverage the collective intelligence of multiple LLMs and enhance their performance, this study investigates three majority voting strategies for text classification, focusing on phishing URL detection. The strategies are: (1) a prompt-based ensemble, which utilizes majority voting across the responses generated by a single LLM to various prompts; (2) a model-based ensemble, which entails aggregating responses from multiple LLMs to a single prompt; and (3) a hybrid ensemble, which combines the two methods by sending different prompts to multiple LLMs and then aggregating their responses. Our analysis shows that ensemble strategies are most suited in cases where individual components exhibit equivalent performance levels. However, when there is a significant discrepancy in individual performance, the effectiveness of the ensemble method may not exceed that of the highest-performing single LLM or prompt. In such instances, opting for ensemble techniques is not recommended.

To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

TL;DR

Abstract

To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)