Table of Contents
Fetching ...

Theoretical Analysis of Submodular Information Measures for Targeted Data Subset Selection

Nathan Beck, Truong Pham, Rishabh Iyer

TL;DR

This work deriving similarity-based bounds on quantities related to relevance and coverage of the targeted data shows that the SMI functions are theoretically sound in achieving good query relevance and query coverage.

Abstract

With increasing volume of data being used across machine learning tasks, the capability to target specific subsets of data becomes more important. To aid in this capability, the recently proposed Submodular Mutual Information (SMI) has been effectively applied across numerous tasks in literature to perform targeted subset selection with the aid of a exemplar query set. However, all such works are deficient in providing theoretical guarantees for SMI in terms of its sensitivity to a subset's relevance and coverage of the targeted data. For the first time, we provide such guarantees by deriving similarity-based bounds on quantities related to relevance and coverage of the targeted data. With these bounds, we show that the SMI functions, which have empirically shown success in multiple applications, are theoretically sound in achieving good query relevance and query coverage.

Theoretical Analysis of Submodular Information Measures for Targeted Data Subset Selection

TL;DR

This work deriving similarity-based bounds on quantities related to relevance and coverage of the targeted data shows that the SMI functions are theoretically sound in achieving good query relevance and query coverage.

Abstract

With increasing volume of data being used across machine learning tasks, the capability to target specific subsets of data becomes more important. To aid in this capability, the recently proposed Submodular Mutual Information (SMI) has been effectively applied across numerous tasks in literature to perform targeted subset selection with the aid of a exemplar query set. However, all such works are deficient in providing theoretical guarantees for SMI in terms of its sensitivity to a subset's relevance and coverage of the targeted data. For the first time, we provide such guarantees by deriving similarity-based bounds on quantities related to relevance and coverage of the targeted data. With these bounds, we show that the SMI functions, which have empirically shown success in multiple applications, are theoretically sound in achieving good query relevance and query coverage.
Paper Structure (18 sections, 16 theorems, 30 equations, 5 figures, 6 tables)

This paper contains 18 sections, 16 theorems, 30 equations, 5 figures, 6 tables.

Key Result

Theorem 1

Let $A$ contain at least one targeted instance ($\chi \geq 1$). Using the notations of Table tab:variables, the Facility Location Mutual Information (FLVMI) enjoys the following bounds on $\chi$:

Figures (5)

  • Figure 1: Illustration of the concepts of query relevance and query coverage. The top of the figure illustrates a scenario where $\mathcal{A}$ is relevant to the queries (X's) in the right cluster of ${\mathcal{T}}$ but does not adequately cover the queries in the left cluster of ${\mathcal{T}}$. The bottom of the figure illustrates a scenario where $\mathcal{A}$ covers all queries in both clusters.
  • Figure 3: Behavior of the relevance bounds derived in Section \ref{['sec:relevance']}. The synthetic datasets are generated by randomly sampling from different Gaussian distributions. Untargeted clusters are colored orange, targeted clusters are colored blue, and query instances are plotted as red X's. Using these dataset configurations, random subsets of cardinality $5$ are drawn with a uniform marginal distribution with respect to $\chi$, and the $I_F(A;Q)$ value is plotted against the $\chi$ (shown as blue points). The lower and upper bounds for each SMI function derived in Section \ref{['sec:relevance']} are plotted in each subfigure and are clipped to be less than the budget (5) and greater than 0.
  • Figure 4: Effect of $\eta$ on the correlation between FLVMI's objective value and $\chi$ on the two-target dataset.
  • Figure 7: Behavior of the coverage bounds derived in Section \ref{['sec:coverage']}. The synthetic datasets are generated by randomly sampling from different Gaussian distributions. Untargeted clusters are colored orange, targeted clusters are colored blue, and query instances are plotted as red X's. Using these dataset configurations, random subsets of cardinality $5$ are drawn with a uniform marginal distribution with respect to $\chi$, and the $I_F(A;Q)$ value is plotted against the $\delta_{\text{avg}}^S$ (shown as blue points). The lower and upper bounds for each SMI function derived in Section \ref{['sec:coverage']} are plotted in each figure. Further, we apply a trivial clipping of the bounds as $\delta_{\text{avg}}^S$ must lie within 0 and 1.
  • Figure 8: Effect of $\eta$ on the correlation between FLVMI's objective value and $\delta_{\text{avg}}^{{\mathcal{T}}\setminus A}$ on the two-target dataset.

Theorems & Definitions (16)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Theorem \ref{thm:flvmi_rel}
  • Theorem \ref{thm:flqmi_rel}
  • ...and 6 more