Table of Contents
Fetching ...

Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups

Mohsen Dehghankar, Abolfazl Asudeh

TL;DR

This work addresses the problem of identifying unknown, under-represented, and under-performing minority groups when grouping attributes are not available. It proposes Minoria mining, a projection-based framework that seeks high-skew tails in feature space where model loss is elevated, using a dual-space geometric interpretation and median-region analysis. A 2D Ray-sweeping algorithm leverages the $n/2$-level of the dual arrangement to efficiently locate top skew directions, with higher-dimensional extensions offered via discretization and focused exploration heuristics. Empirical results on Chicago Crimes, College Admissions, and several higher-dimensional datasets demonstrate that the proposed method can uncover actionable minority regions that clustering baselines often miss, providing a practical tool for responsible data auditing and bias detection. Limitations include the linear-separability assumption and practical challenges in very high dimensions, motivating future work on nonlinear projections and alternative Minoria mining strategies.

Abstract

Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.

Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups

TL;DR

This work addresses the problem of identifying unknown, under-represented, and under-performing minority groups when grouping attributes are not available. It proposes Minoria mining, a projection-based framework that seeks high-skew tails in feature space where model loss is elevated, using a dual-space geometric interpretation and median-region analysis. A 2D Ray-sweeping algorithm leverages the -level of the dual arrangement to efficiently locate top skew directions, with higher-dimensional extensions offered via discretization and focused exploration heuristics. Empirical results on Chicago Crimes, College Admissions, and several higher-dimensional datasets demonstrate that the proposed method can uncover actionable minority regions that clustering baselines often miss, providing a practical tool for responsible data auditing and bias detection. Limitations include the linear-separability assumption and practical challenges in very high dimensions, motivating future work on nonlinear projections and alternative Minoria mining strategies.

Abstract

Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.

Paper Structure

This paper contains 48 sections, 2 theorems, 9 equations, 11 figures, 9 tables, 2 algorithms.

Key Result

Lemma 1

The ordering of points in $\mathcal{D}_f$ is the same as the reverse order of intersections of $\mathsf{d}(t_i)$ hyperplanes with $\vec{r_f}$. Hence, the dual-space transformation preserves the order of projected points.

Figures (11)

  • Figure 1: Toy example: Minoria mining using high-skew projections.
  • Figure 2: The illustration of the toy dataset $\mathcal{D}=\{t_1\langle .5,1.5 \rangle, t_2\langle 1,.75 \rangle, t_3\langle 2,1 \rangle\}$ in the primal space, the dual space, and the 2nd level of the arrangement in the first quadrant. The order of projection in Fig. \ref{['fig:toy1:primal']} is the reverse of the intersection of dual hyperplanes with $r_f$ ($[\mathsf{d}(t_3), \mathsf{d}(t_1), \mathsf{d}(t_2)]$) in Fig. \ref{['fig:toy1:dual']}. In Fig. \ref{['fig:toy1:level']}, the dotted blue lines indicate the boundaries of the median regions (the change in the line segment indicates a change in the median (2nd point) of $\mathcal{D}_f$).
  • Figure 3: A toy example for the Ray-Sweeping algorithm. (a) and (b): Starting from $5$ points in the plane, the algorithm finds the $\frac{n}{2}$-the level of arrangements for the dual lines. (c): Each median region is a line segment in this arrangement. The line segment represents the median point for any direction in this region. (d): The algorithm sweeps a ray from the $x$-axis to the $y$-axis for each region, calculates the skew value in this region efficiently, and pushes this value into a heap. In the end, the best directions, in terms of their skew, are extracted from the heap.
  • Figure 4: Top high-skew directions on the Latitude and Longitude of Chicago Crimes dataset are presented in these two plots (the plots show a smaller sample of the dataset points). In the first plot, the output of the binary classification model is presented with green and blue lines. The red line shows the direction $f$ with the highest projection skew. The orange region shows the tail region.
  • Figure 5: The result of running the Ray-sweeping algorithm on the College-admission dataset.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Example 1
  • Lemma 1
  • Definition 1: Median Region
  • Theorem 2