Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework

Tosin Ige; Christopher Kiekintveld; Aritran Piplai

Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework

Tosin Ige, Christopher Kiekintveld, Aritran Piplai

TL;DR

Phishing detection remains challenged by attackers leveraging multi-modal content such as images and deepfake videos embedded in web pages. The authors propose a four-layer adaptive framework that integrates computer vision (OCR) to read images, speech synthesis from videos, and NLP-derived text, powered by Random Forest and LSTM to fuse predictions across modalities. Key contributions include the multi-layer design, processing of text, images, and audio to improve detection, and publicly available artifacts for reproducibility. The approach demonstrates robust detection of complex, multi-modal phishing sites and offers potential practical enhancements for real-world anti-phishing Systems.

Abstract

The ever-evolving ways attacker continues to im prove their phishing techniques to bypass existing state-of-the-art phishing detection methods pose a mountain of challenges to researchers in both industry and academia research due to the inability of current approaches to detect complex phishing attack. Thus, current anti-phishing methods remain vulnerable to complex phishing because of the increasingly sophistication tactics adopted by attacker coupled with the rate at which new tactics are being developed to evade detection. In this research, we proposed an adaptable framework that combines Deep learning and Randon Forest to read images, synthesize speech from deep-fake videos, and natural language processing at various predictions layered to significantly increase the performance of machine learning models for phishing attack detection.

Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework

TL;DR

Abstract

Paper Structure (10 sections, 2 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 10 sections, 2 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Related Work
Natural Language Processing (NLP)
Long Short-Term Memory (LSTM)
Experimental Setup
Dataset
Settings
Framework Adaptability and Performance Evaluation
Conclusion
Limitation and Future Research direction

Figures (3)

Figure 1: Step-wise speech synthesis of each audio file during execution of "for" loop in layer 3 to produce text which was later passed on to layer 4. Texts from the phishing sites were processed at Layer 1, images were processed at Layer 2, while Layer 3 processed videos. All text was finally outputted to layer 4 for final prediction using a variant of Recurrent Neural Network in Long Short-Term Memory.
Figure 2: LSTM network resulting in 0.98 accuracy at optimal parameter
Figure 3: LSTM network resulting in 0.08 loss at optimal parameter.

Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework

TL;DR

Abstract

Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (3)