Table of Contents
Fetching ...

Automating the Analysis of Public Saliency and Attitudes towards Biodiversity from Digital Media

Noah Giebink, Amrita Gupta, Diogo Verìssimo, Charlotte H. Chang, Tony Chang, Angela Brennan, Brett Dickson, Alex Bowmer, Jonathan Baillie

TL;DR

The paper addresses the challenge of globally assessing public attitudes toward biodiversity by introducing a scalable NLP pipeline that combines a folk taxonomy for search-term generation with zero-shot relevance filtering to curate biodiversity discourse from news and X data. It integrates TF-IDF cosine similarity for deduplication, full-text scraping, and sentiment/topic analyses, enabling time- and space-resolved insights. A case study around the COVID-19 period reveals taxon-specific shifts in volume and sentiment, and demonstrates substantial non-biodiversity content and content syndication that the method helps to filter out. The approach offers conservation practitioners an out-of-the-box framework for real-time, global monitoring aligned with the Global Biodiversity Framework, while highlighting future needs such as multilingual support and adaptation to evolving platforms.

Abstract

Measuring public attitudes toward wildlife provides crucial insights into our relationship with nature and helps monitor progress toward Global Biodiversity Framework targets. Yet, conducting such assessments at a global scale is challenging. Manually curating search terms for querying news and social media is tedious, costly, and can lead to biased results. Raw news and social media data returned from queries are often cluttered with irrelevant content and syndicated articles. We aim to overcome these challenges by leveraging modern Natural Language Processing (NLP) tools. We introduce a folk taxonomy approach for improved search term generation and employ cosine similarity on Term Frequency-Inverse Document Frequency vectors to filter syndicated articles. We also introduce an extensible relevance filtering pipeline which uses unsupervised learning to reveal common topics, followed by an open-source zero-shot Large Language Model (LLM) to assign topics to news article titles, which are then used to assign relevance. Finally, we conduct sentiment, topic, and volume analyses on resulting data. We illustrate our methodology with a case study of news and X (formerly Twitter) data before and during the COVID-19 pandemic for various mammal taxa, including bats, pangolins, elephants, and gorillas. During the data collection period, up to 62% of articles including keywords pertaining to bats were deemed irrelevant to biodiversity, underscoring the importance of relevance filtering. At the pandemic's onset, we observed increased volume and a significant sentiment shift toward horseshoe bats, which were implicated in the pandemic, but not for other focal taxa. The proposed methods open the door to conservation practitioners applying modern and emerging NLP tools, including LLMs "out of the box," to analyze public perceptions of biodiversity during current events or campaigns.

Automating the Analysis of Public Saliency and Attitudes towards Biodiversity from Digital Media

TL;DR

The paper addresses the challenge of globally assessing public attitudes toward biodiversity by introducing a scalable NLP pipeline that combines a folk taxonomy for search-term generation with zero-shot relevance filtering to curate biodiversity discourse from news and X data. It integrates TF-IDF cosine similarity for deduplication, full-text scraping, and sentiment/topic analyses, enabling time- and space-resolved insights. A case study around the COVID-19 period reveals taxon-specific shifts in volume and sentiment, and demonstrates substantial non-biodiversity content and content syndication that the method helps to filter out. The approach offers conservation practitioners an out-of-the-box framework for real-time, global monitoring aligned with the Global Biodiversity Framework, while highlighting future needs such as multilingual support and adaptation to evolving platforms.

Abstract

Measuring public attitudes toward wildlife provides crucial insights into our relationship with nature and helps monitor progress toward Global Biodiversity Framework targets. Yet, conducting such assessments at a global scale is challenging. Manually curating search terms for querying news and social media is tedious, costly, and can lead to biased results. Raw news and social media data returned from queries are often cluttered with irrelevant content and syndicated articles. We aim to overcome these challenges by leveraging modern Natural Language Processing (NLP) tools. We introduce a folk taxonomy approach for improved search term generation and employ cosine similarity on Term Frequency-Inverse Document Frequency vectors to filter syndicated articles. We also introduce an extensible relevance filtering pipeline which uses unsupervised learning to reveal common topics, followed by an open-source zero-shot Large Language Model (LLM) to assign topics to news article titles, which are then used to assign relevance. Finally, we conduct sentiment, topic, and volume analyses on resulting data. We illustrate our methodology with a case study of news and X (formerly Twitter) data before and during the COVID-19 pandemic for various mammal taxa, including bats, pangolins, elephants, and gorillas. During the data collection period, up to 62% of articles including keywords pertaining to bats were deemed irrelevant to biodiversity, underscoring the importance of relevance filtering. At the pandemic's onset, we observed increased volume and a significant sentiment shift toward horseshoe bats, which were implicated in the pandemic, but not for other focal taxa. The proposed methods open the door to conservation practitioners applying modern and emerging NLP tools, including LLMs "out of the box," to analyze public perceptions of biodiversity during current events or campaigns.
Paper Structure (19 sections, 9 figures, 3 tables)

This paper contains 19 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A diagram of the data pipeline, starting from constructing a folk taxonomy to derive search terms; retrieving news and tweets by querying each data source; performing zero-shot relevance modeling and scraping to obtain full-text for the news media articles; filtering out syndicated news and identifying specific references to queried taxa within news articles; and finally conducting analyses on shifts in volume, sentiment, and topics in the tweets and news articles through time and over space.
  • Figure 2: Example of an initial connected component in the folk taxonomy graph for species in Order Carnivora based on their IUCN Red List common names. Solid lines represent edges between species and their listed common names; dashed lines represent edges between names and simplified names; and dotted lines represent connections that would be pruned on inspection to separate conceptually distinct taxa.
  • Figure 3: Number of news articles obtained at each stages in the GDELT data collection pipeline run fully on ten query taxa, from querying, to relevance filtering, webscraping, and deduplication.
  • Figure 4: Chord diagrams depicting the co-occurrence of relevant topics for two focal taxa, Horseshoe bat (family Rhinolophidae) and Long-tongued bat (genera Glossophaga, Craseonycteris, and Leptonycteris). A wider chord indicates that more articles contain two taxa and each chord is a band colored by one of the nodes that is being connected. The circular perimeter of the chord displays the proportional occurrence of each topic in the dataset, and the colors correspond to different groups of topics.
  • Figure 5: Changes in volume through time. The solid vertical magenta line denotes March 11, 2020, which was the date when the UN WHO declared COVID-19 a pandemic. The dashed vertical orange or blue lines correspond to any significant breakpoints in the trend for GDELT or Twitter respectively, after conducting Bonferroni family-wise error correction.
  • ...and 4 more figures