Examining the Challenges in Archiving Instagram
Rachel Zheng, Michele C. Weigle
TL;DR
This paper investigates the challenges of archiving Instagram content to support disinformation research, focusing on the prevalence of mementos that redirect to the Instagram login page starting in August 2019 and the resulting decline in replayable records. It analyzes mementos from the Wayback Machine and other public archives, demonstrates significant limitations in archival quantity and quality, and introduces a Python-based scraper (instagram_memento_scrape.py) to extract structured metadata from memento page sources dating 2012–2018. The findings show that while Wayback’s login-redirect issue dramatically reduces replayability, some replayable data remains, especially for early mementos, and that non-Wayback archives exhibit similar or worse coverage. The work provides a practical tool to harvest extractable data (e.g., follower counts, bios, captions, image URLs) from archived Instagram pages, enabling continued disinformation analysis despite archiving constraints, and outlines concrete directions for extending data collection and analysis in future work.
Abstract
To prevent the spread of disinformation on Instagram, we need to study the accounts and content of disinformation actors. However, due to their malicious nature, Instagram often bans accounts that are responsible for spreading disinformation, making these accounts inaccessible from the live web. The only way we can study the content of banned accounts is through public web archives such as the Internet Archive. However, there are many issues present with archiving Instagram pages. Specifically, we focused on the issue that many Wayback Machine Instagram mementos redirect to the Instagram login page. In this study, we determined that mementos of Instagram account pages on the Wayback Machine began redirecting to the Instagram login page in August 2019. We also found that Instagram mementos on Archive.today, Arquivo.pt, and Perma.cc are also not well archived in terms of quantity and quality. Moreover, we were unsuccessful in all our attempts to archive Katy Perry's Instagram account page on Archive.today, Arquivo.pt, and Conifer. Although in the minority, replayable Instagram mementos exist in public archives and contain valuable data for studying disinformation on Instagram. With that in mind, we developed a Python script to web scrape Instagram mementos. As of August 2023, the Python script can scrape Wayback Machine archives of Instagram account pages between November 7, 2012 and June 8, 2018.
