THIR: Topological Histopathological Image Retrieval
Zahra Tabatabaei, Jon Sporring
TL;DR
THIR tackles the challenge of efficient, label-free CBMIR in digital pathology by leveraging topological data analysis. It builds a $3R$-dimensional descriptor from cubical persistence across the RGB channels, focusing on the Betti curve for $\beta_1$ and performs retrieval with Euclidean distance in a fully unsupervised, training-free pipeline. On the BreaKHis dataset, THIR delivers state-of-the-art retrieval performance across magnifications while requiring only about 20 minutes of CPU computation and no training, highlighting interpretability and practicality in resource-constrained clinical settings. These results demonstrate the potential of topological signatures as robust, scalable features for histopathological image retrieval and suggest avenues for expanding to higher-dimensional, multi-channel, and whole-slide analyses.
Abstract
According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.
