Towards Statistically Significant Taxonomy Aware Co-location Pattern Detection
Subhankar Ghosh, Arun Sharma, Jayant Gupta, Shashi Shekhar
TL;DR
This work tackles statistically significant co-location pattern detection within hierarchical taxonomies of spatial features. It proposes two algorithms, SSTCM and FDR-SSTCM, that integrate taxonomy structure with significance testing, the latter applying the Benjamini-Hochberg procedure to control the false discovery rate. Core ideas include using a participation index-based strength measure and $p$-value computations against CSR-derived null models, facilitating multi-level pattern evaluation. Evaluations on synthetic data and a real retail case study (SafeGraph MN NAICS) show reduced false discoveries and higher power, with practical implications for ecology, spatial pathology, and retail analytics.
Abstract
Given a collection of Boolean spatial feature types, their instances, a neighborhood relation (e.g., proximity), and a hierarchical taxonomy of the feature types, the goal is to find the subsets of feature types or their parents whose spatial interaction is statistically significant. This problem is for taxonomy-reliant applications such as ecology (e.g., finding new symbiotic relationships across the food chain), spatial pathology (e.g., immunotherapy for cancer), retail, etc. The problem is computationally challenging due to the exponential number of candidate co-location patterns generated by the taxonomy. Most approaches for co-location pattern detection overlook the hierarchical relationships among spatial features, and the statistical significance of the detected patterns is not always considered, leading to potential false discoveries. This paper introduces two methods for incorporating taxonomies and assessing the statistical significance of co-location patterns. The baseline approach iteratively checks the significance of co-locations between leaf nodes or their ancestors in the taxonomy. Using the Benjamini-Hochberg procedure, an advanced approach is proposed to control the false discovery rate. This approach effectively reduces the risk of false discoveries while maintaining the power to detect true co-location patterns. Experimental evaluation and case study results show the effectiveness of the approach.
