Table of Contents
Fetching ...

A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears

Frauke Wilm, Luis Carlos Rivera Monroy, Mathias Öttl, Lukas Mürdter, Leonid Mill, Andreas Maier

TL;DR

This work presents an enhanced Plasmodium falciparum dataset by converting point-based NIH malaria annotations into dense COCO-format bounding boxes via Cellpose-based segmentation and targeted manual refinement. A Faster R-CNN model trained on the revised annotations detects infected and non-infected red blood cells and white blood cells, achieving up to an F1 score of 0.88 for infected cells in cross-dataset evaluation. The study demonstrates that annotation volume and consistency improve detection performance, and that automated refinement plus manual curation can produce high-quality training data suitable for automated malaria diagnosis. The updated dataset, along with code and annotations, is publicly available to accelerate development of object-detection-based malaria diagnostics, particularly in low-resource settings.

Abstract

Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via Zenodo: https://doi.org/10.5281/zenodo.17514694

A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears

TL;DR

This work presents an enhanced Plasmodium falciparum dataset by converting point-based NIH malaria annotations into dense COCO-format bounding boxes via Cellpose-based segmentation and targeted manual refinement. A Faster R-CNN model trained on the revised annotations detects infected and non-infected red blood cells and white blood cells, achieving up to an F1 score of 0.88 for infected cells in cross-dataset evaluation. The study demonstrates that annotation volume and consistency improve detection performance, and that automated refinement plus manual curation can produce high-quality training data suitable for automated malaria diagnosis. The updated dataset, along with code and annotations, is publicly available to accelerate development of object-detection-based malaria diagnostics, particularly in low-resource settings.

Abstract

Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via Zenodo: https://doi.org/10.5281/zenodo.17514694

Paper Structure

This paper contains 9 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Different annotation types provided by the nih dataset. (a): contour annotations, (b): point-only annotations, (c) bounding box annotations created with Cellpose pachitariu2022cellpose. Blue: non-infected red blood cells, pink: infected cells, green: white blood cells, orange: ambiguous cells.
  • Figure 2: During label cleaning, non-annotated cells at the border of the field of view were labeled as ambiguous (orange). Blue: non-infected red blood cells, pink: infected cells, green: white blood cells.
  • Figure 3: Confusion matrices for Faster R-CNN predictions on the NIH subsets. Each matrix shows row-normalized percentages along with absolute cell counts. The last row indicates fp, i. e., cell instances detected by the model but not annotated in the dataset. The last column indicates fn, i. e., annotated cell instances that were not detected by the model.
  • Figure 4: Representative samples from NIH subsets with white arrows indicating non-annotated cells at the border of field of view: (a) sample from the polygon subset with detailed contour annotations, (b) sample from the point subset with spot annotations in the cell center.