Table of Contents
Fetching ...

CRATOR: a Dark Web Crawler

Daniel De Pascale, Giuseppe Cascavilla, Damian A. Tamburri, Willem-Jan Van Den Heuvel

TL;DR

CRATOR addresses the challenge of systematically crawling the dark Web to extract data behind anonymity and security layers. It introduces a general-purpose, Tor-enabled crawler that uses seed lists, BFS, cookie rotation, login automation, and real-time human intervention for CAPTCHAs to bypass protections while maintaining anonymity. The paper presents an architecture with manual and automated cookie handling, a cookie-validation layer, and a connection module leveraging proxies and user-agent rotation. It demonstrates, via a comparative evaluation against ACHE on Cocorico Market, that CRATOR achieves higher coverage, faster data collection, and greater robustness, suggesting practical utility for threat intelligence, cybersecurity, and online investigations.

Abstract

Dark web crawling is a complex process that involves specific methodologies and techniques to navigate the Tor network and extract data from hidden services. This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas, efficiently. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content. We also incorporate methods for user-agent rotation and proxy usage to maintain anonymity and avoid detection. We evaluate the effectiveness of our crawler using metrics such as coverage, performance and robustness. Our results demonstrate that our crawler effectively extracts pages handling security protocols while maintaining anonymity and avoiding detection. Our proposed dark web crawler can be used for various applications, including threat intelligence, cybersecurity, and online investigations.

CRATOR: a Dark Web Crawler

TL;DR

CRATOR addresses the challenge of systematically crawling the dark Web to extract data behind anonymity and security layers. It introduces a general-purpose, Tor-enabled crawler that uses seed lists, BFS, cookie rotation, login automation, and real-time human intervention for CAPTCHAs to bypass protections while maintaining anonymity. The paper presents an architecture with manual and automated cookie handling, a cookie-validation layer, and a connection module leveraging proxies and user-agent rotation. It demonstrates, via a comparative evaluation against ACHE on Cocorico Market, that CRATOR achieves higher coverage, faster data collection, and greater robustness, suggesting practical utility for threat intelligence, cybersecurity, and online investigations.

Abstract

Dark web crawling is a complex process that involves specific methodologies and techniques to navigate the Tor network and extract data from hidden services. This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas, efficiently. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content. We also incorporate methods for user-agent rotation and proxy usage to maintain anonymity and avoid detection. We evaluate the effectiveness of our crawler using metrics such as coverage, performance and robustness. Our results demonstrate that our crawler effectively extracts pages handling security protocols while maintaining anonymity and avoiding detection. Our proposed dark web crawler can be used for various applications, including threat intelligence, cybersecurity, and online investigations.
Paper Structure (22 sections, 4 figures, 4 tables)

This paper contains 22 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture.
  • Figure 2: Caption for both figures
  • Figure 3: Performance metrics.
  • Figure 4: Error rate.