MaskSearch: Querying Image Masks at Scale
Dong He, Jieyu Zhang, Maureen Daum, Alexander Ratner, Magdalena Balazinska
TL;DR
MaskSearch tackles the challenge of efficiently querying large databases of image masks with predicates over regions of interest and pixel-value ranges, while guaranteeing exact results. It introduces the Cumulative Histogram Index (CHI) to bound CP queries and a filter-verification execution framework that prunes most masks before disk I/O, achieving large speedups. The system supports aggregations, top-k, and incremental indexing to enable online, multi-query exploration with low upfront costs. Empirical evaluation on WILDS and ImageNet demonstrates up to two orders of magnitude faster queries and robust performance on diverse workloads, illustrating substantial practical impact for mask-based ML explainability and debugging.
Abstract
Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps, depth maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do not support them efficiently. In this paper, we formalize the problem and propose MaskSearch, a system that focuses on accelerating queries over databases of image masks while guaranteeing the correctness of query results. MaskSearch leverages a novel indexing technique and an efficient filter-verification query execution framework. Experiments with our prototype show that MaskSearch, using indexes approximately 5% of the compressed data size, accelerates individual queries by up to two orders of magnitude and consistently outperforms existing methods on various multi-query workloads that simulate dataset exploration and analysis processes.
