Deep Learning-Based Speech and Vision Synthesis to Improve Phishing Attack Detection through a Multi-layer Adaptive Framework
Tosin Ige, Christopher Kiekintveld, Aritran Piplai
TL;DR
Phishing detection remains challenged by attackers leveraging multi-modal content such as images and deepfake videos embedded in web pages. The authors propose a four-layer adaptive framework that integrates computer vision (OCR) to read images, speech synthesis from videos, and NLP-derived text, powered by Random Forest and LSTM to fuse predictions across modalities. Key contributions include the multi-layer design, processing of text, images, and audio to improve detection, and publicly available artifacts for reproducibility. The approach demonstrates robust detection of complex, multi-modal phishing sites and offers potential practical enhancements for real-world anti-phishing Systems.
Abstract
The ever-evolving ways attacker continues to im prove their phishing techniques to bypass existing state-of-the-art phishing detection methods pose a mountain of challenges to researchers in both industry and academia research due to the inability of current approaches to detect complex phishing attack. Thus, current anti-phishing methods remain vulnerable to complex phishing because of the increasingly sophistication tactics adopted by attacker coupled with the rate at which new tactics are being developed to evade detection. In this research, we proposed an adaptable framework that combines Deep learning and Randon Forest to read images, synthesize speech from deep-fake videos, and natural language processing at various predictions layered to significantly increase the performance of machine learning models for phishing attack detection.
