Table of Contents
Fetching ...

To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

Fouad Trad, Ali Chehab

TL;DR

The paper investigates the viability of ensemble majority voting for phishing URL detection using LLMs. It introduces three strategies—prompt-based, model-based, and hybrid—and evaluates them across five LLMs with zero/one/two-shot prompts on a PhishStorm subset. The results show that ensembles can yield gains when component performance is similar, but do not outperform the top model when there is a large performance gap, as with GPT-4. The work provides practical guidance on when to deploy ensemble LLMs for phishing detection and outlines avenues for dynamic, weighted voting and broader LLM coverage.

Abstract

The effectiveness of Large Language Models (LLMs) significantly relies on the quality of the prompts they receive. However, even when processing identical prompts, LLMs can yield varying outcomes due to differences in their training processes. To leverage the collective intelligence of multiple LLMs and enhance their performance, this study investigates three majority voting strategies for text classification, focusing on phishing URL detection. The strategies are: (1) a prompt-based ensemble, which utilizes majority voting across the responses generated by a single LLM to various prompts; (2) a model-based ensemble, which entails aggregating responses from multiple LLMs to a single prompt; and (3) a hybrid ensemble, which combines the two methods by sending different prompts to multiple LLMs and then aggregating their responses. Our analysis shows that ensemble strategies are most suited in cases where individual components exhibit equivalent performance levels. However, when there is a significant discrepancy in individual performance, the effectiveness of the ensemble method may not exceed that of the highest-performing single LLM or prompt. In such instances, opting for ensemble techniques is not recommended.

To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

TL;DR

The paper investigates the viability of ensemble majority voting for phishing URL detection using LLMs. It introduces three strategies—prompt-based, model-based, and hybrid—and evaluates them across five LLMs with zero/one/two-shot prompts on a PhishStorm subset. The results show that ensembles can yield gains when component performance is similar, but do not outperform the top model when there is a large performance gap, as with GPT-4. The work provides practical guidance on when to deploy ensemble LLMs for phishing detection and outlines avenues for dynamic, weighted voting and broader LLM coverage.

Abstract

The effectiveness of Large Language Models (LLMs) significantly relies on the quality of the prompts they receive. However, even when processing identical prompts, LLMs can yield varying outcomes due to differences in their training processes. To leverage the collective intelligence of multiple LLMs and enhance their performance, this study investigates three majority voting strategies for text classification, focusing on phishing URL detection. The strategies are: (1) a prompt-based ensemble, which utilizes majority voting across the responses generated by a single LLM to various prompts; (2) a model-based ensemble, which entails aggregating responses from multiple LLMs to a single prompt; and (3) a hybrid ensemble, which combines the two methods by sending different prompts to multiple LLMs and then aggregating their responses. Our analysis shows that ensemble strategies are most suited in cases where individual components exhibit equivalent performance levels. However, when there is a significant discrepancy in individual performance, the effectiveness of the ensemble method may not exceed that of the highest-performing single LLM or prompt. In such instances, opting for ensemble techniques is not recommended.

Paper Structure

This paper contains 19 sections, 8 figures.

Figures (8)

  • Figure 1: Proposed ensemble methods
  • Figure 2: Zero-shot, one-shot, and two-shot prompts used to classify URLs as phishing or legitimate
  • Figure 3: Individual LLM performance for each prompt
  • Figure 4: Comparison of prompt-based ensembling with the highest performing prompt for each LLM
  • Figure 5: Comparison of model-based ensembling across the five models, with the highest performing model for each prompt type
  • ...and 3 more figures