MagnetDB: A Longitudinal Torrent Discovery Dataset with IMDb-Matched Movies and TV Shows
Scott Seidenberger, Noah Pursell, Anindya Maiti
TL;DR
MagnetDB addresses the need for a long-running, content-rich view of BitTorrent supply-side dynamics by continuously crawling the BitTorrent DHT over 2018–2024 to compile 28.6M torrents and 950M files, later enriching a subset of video files with IMDb metadata via an Elasticsearch-based matching pipeline using a $2\sigma$ threshold. The approach blends large-scale torrent discovery with detailed file-level parsing and metadata augmentation, yielding a 1.56M IMDb-matched video file subset across roughly 751K movies and 811K TV episodes. Key contributions include a longitudinal, content-centric dataset that reveals supply-side practices, The Scene’s influence, and selective coverage patterns, alongside a transparent processing pipeline and FAIR data principles to enable reproducibility and broad reuse. The work’s significance lies in enabling nuanced analyses of piracy ecosystems, distribution trajectories, and cross-genre platform targeting, with practical implications for cultural analytics, policy, and anti-piracy strategies. MagnetDB thus provides a foundational, open resource for studying the socio-technical dynamics of digital piracy at scale.
Abstract
BitTorrent remains a prominent channel for illicit distribution of copyrighted material, yet the supply side of such content remains understudied. We introduce MagnetDB, a longitudinal dataset of torrents discovered through the BitTorrent DHT between 2018 and 2024, containing more than 28.6 million torrents and metadata of more than 950 million files. While our primary focus is on enabling research based on the supply of pirated movies and TV shows, the dataset also encompasses other legitimate and illegitimate torrents. By applying IMDb-matching and annotation to movie and TV show torrents, MagnetDB facilitates detailed analyses of pirated content evolution in the BitTorrent network. Researchers can leverage MagnetDB to examine distribution trends, subcultural practices, and the gift economy within piracy ecosystems. Through its scale and temporal scope, MagnetDB presents a unique opportunity for investigating the broader dynamics of BitTorrent and advancing empirical knowledge on digital piracy.
