Table of Contents
Fetching ...

$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

Fengyu Cai, Xinran Zhao, Hongming Zhang, Iryna Gurevych, Heinz Koeppl

TL;DR

This work formally initiates the concept of $\textit{class-wise hardness}$ and proposes $\textit{GeoHard}$ for class-wise hardness measurement by modeling class geometry in the semantic embedding space by modeling class geometry in the semantic embedding space.

Abstract

Recent advances in measuring hardness-wise properties of data guide language models in sample selection within low-resource scenarios. However, class-specific properties are overlooked for task setup and learning. How will these properties influence model learning and is it generalizable across datasets? To answer this question, this work formally initiates the concept of $\textit{class-wise hardness}$. Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment. Subsequent experiments unveil a notable challenge in measuring such class-wise hardness with instance-level metrics in previous works. To address this, we propose $\textit{GeoHard}$ for class-wise hardness measurement by modeling class geometry in the semantic embedding space. $\textit{GeoHard}$ surpasses instance-level metrics by over 59 percent on $\textit{Pearson}$'s correlation on measuring class-wise hardness. Our analysis theoretically and empirically underscores the generality of $\textit{GeoHard}$ as a fresh perspective on data diagnosis. Additionally, we showcase how understanding class-wise hardness can practically aid in improving task learning.

$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

TL;DR

This work formally initiates the concept of and proposes for class-wise hardness measurement by modeling class geometry in the semantic embedding space by modeling class geometry in the semantic embedding space.

Abstract

Recent advances in measuring hardness-wise properties of data guide language models in sample selection within low-resource scenarios. However, class-specific properties are overlooked for task setup and learning. How will these properties influence model learning and is it generalizable across datasets? To answer this question, this work formally initiates the concept of . Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment. Subsequent experiments unveil a notable challenge in measuring such class-wise hardness with instance-level metrics in previous works. To address this, we propose for class-wise hardness measurement by modeling class geometry in the semantic embedding space. surpasses instance-level metrics by over 59 percent on 's correlation on measuring class-wise hardness. Our analysis theoretically and empirically underscores the generality of as a fresh perspective on data diagnosis. Additionally, we showcase how understanding class-wise hardness can practically aid in improving task learning.
Paper Structure (54 sections, 1 theorem, 9 equations, 10 figures, 21 tables, 1 algorithm)

This paper contains 54 sections, 1 theorem, 9 equations, 10 figures, 21 tables, 1 algorithm.

Key Result

Theorem 1

Assuming a Gaussian distribution for instances within $c_k$, $D \sim \mathcal{N}(\mu_{c_k}, \sigma_{c_k}^2)$, the means of the training and test data can be represented as $\hat{\mu}_{c_k}^{tr} \sim \mathcal{N}(\mu_{c_k}, \sigma_{c_k}^2 / n_{tr})$ and $\hat{\mu}_{c_k}^{te} \sim \mathcal{N}(\mu_{c_k}

Figures (10)

  • Figure 1: The examples of premise-hypothesis pairs in uncertain NLI ($u$-NLI; DBLP:conf/acl/ChenJPSD20). In $u$-NLI, the probability of these pairs (in the parentheses) is annotated by crowdworkers. The example showcases NEU's Middlemost and Diverse semantics, i.e., positioning in the middle between ENT and CON and widely ranging from low (14%) to high probability (84%).
  • Figure 2: Correlation matrix among class-wise F1 scores of three finetuned models together with two ICLs and class-wise human disagreement on SNLI, where the high consistency is noted. Figure \ref{['fig:mnli_corr']} presents MNLI's correlation matrix in Appendix \ref{['appendix:hardness:llms']}.
  • Figure 3: The illustration of GeoHard in semantic embeddings space. The ellipses approximate class-wise data distribution. Class 2 is speculated to be difficult due to its large variance and middlemost location.
  • Figure 4: The ratio between F1 scores on the test and training sets for training epochs on NLI tasks. Neutralin blue suffers from overfitting most. Figure \ref{['fig:sc_dist_diff']} in Appendix \ref{['appendix:general:figure']} presents a similar issue in NLI tasks.
  • Figure 5: Average Pearson's coefficient between various metrics and hardness reference on five SC tasks. GeoHard with different embeddings consistently and significantly outperform instance-level aggregation, demonstrating the robustness of GeoHard.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1