Table of Contents
Fetching ...

Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection

Devesh Pant, Rishi Raj Grandhe, Vipin Samaria, Mukul Paul, Sudhir Kumar, Saransh Khanna, Jatin Agrawal, Jushaan Singh Kalra, Akhil VSSG, Satish V Khalikar, Vipin Garg, Himanshu Chauhan, Pranay Verma, Neha Khandelwal, Soma S Dhavala, Minesh Mathew

TL;DR

Health Sentinel presents a multilingual, end-to-end pipeline for real-time outbreak detection from online media by integrating data ingestion, language-aware filtering, translation, event extraction via QA/NLI and LLMs, entity mapping, clustering, and human review. The system scales to 13 Indic languages and 122 diseases, achieving high recall in article relevance and strong performance gains in end-to-end event extraction through LLMs, with robust daily clustering to de-duplicate reports. Deployed since 2022, it has processed over 300 million articles and identified more than 95 000 health events, of which around 3 500 were shortlisted by public health experts, demonstrating substantial impact for India's event-based surveillance. The work highlights the practical value of combining multilingual NLP, LLM-based information extraction, and human-in-the-loop verification to support timely public health responses while addressing misinformation and regional representation challenges.

Abstract

Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.

Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection

TL;DR

Health Sentinel presents a multilingual, end-to-end pipeline for real-time outbreak detection from online media by integrating data ingestion, language-aware filtering, translation, event extraction via QA/NLI and LLMs, entity mapping, clustering, and human review. The system scales to 13 Indic languages and 122 diseases, achieving high recall in article relevance and strong performance gains in end-to-end event extraction through LLMs, with robust daily clustering to de-duplicate reports. Deployed since 2022, it has processed over 300 million articles and identified more than 95 000 health events, of which around 3 500 were shortlisted by public health experts, demonstrating substantial impact for India's event-based surveillance. The work highlights the practical value of combining multilingual NLP, LLM-based information extraction, and human-in-the-loop verification to support timely public health responses while addressing misinformation and regional representation challenges.

Abstract

Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.

Paper Structure

This paper contains 39 sections, 6 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Health Sentinel extracts structured information from online articles reporting unusual health events. The given example shows how our pipeline extracts multiple events from a single news article.
  • Figure 2: System Overview of Health Sentinel. Health Sentinel combines rule based and ML techniques alongside a human-in-the-loop system to ensure a high level of reliance and efficiency. Its data ingestion pipeline continuously collects news articles from the web and stores them in a database. The article processing pipeline retrieves these articles, filters out irrelevant data, and extracts health events. The extracted events are then sent for expert review before publication for ground-level action.
  • Figure 3: Logic for mapping extracted locations to appropriate State, District, Sub-district, and Urban Local Bodies (ULBs). First, individual locations are extracted from the comma separated values. The process starts with assigning a state if present, followed by assigning a district and sub-district/ULB. If a state is not identified, the logic tries to assign a district or sub-district/ULB and then reverse maps to determine the corresponding state. If multiple values are found during assignment, the location is not mapped.
  • Figure 4: Logic flow of the rules that are used to determine the threshold that is applied on the similarity score for a pair of events