Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups
Mohsen Dehghankar, Abolfazl Asudeh
TL;DR
This work addresses the problem of identifying unknown, under-represented, and under-performing minority groups when grouping attributes are not available. It proposes Minoria mining, a projection-based framework that seeks high-skew tails in feature space where model loss is elevated, using a dual-space geometric interpretation and median-region analysis. A 2D Ray-sweeping algorithm leverages the $n/2$-level of the dual arrangement to efficiently locate top skew directions, with higher-dimensional extensions offered via discretization and focused exploration heuristics. Empirical results on Chicago Crimes, College Admissions, and several higher-dimensional datasets demonstrate that the proposed method can uncover actionable minority regions that clustering baselines often miss, providing a practical tool for responsible data auditing and bias detection. Limitations include the linear-separability assumption and practical challenges in very high dimensions, motivating future work on nonlinear projections and alternative Minoria mining strategies.
Abstract
Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.
