$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

Fengyu Cai; Xinran Zhao; Hongming Zhang; Iryna Gurevych; Heinz Koeppl

$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

Fengyu Cai, Xinran Zhao, Hongming Zhang, Iryna Gurevych, Heinz Koeppl

TL;DR

This work formally initiates the concept of $\textit{class-wise hardness}$ and proposes $\textit{GeoHard}$ for class-wise hardness measurement by modeling class geometry in the semantic embedding space by modeling class geometry in the semantic embedding space.

Abstract

Recent advances in measuring hardness-wise properties of data guide language models in sample selection within low-resource scenarios. However, class-specific properties are overlooked for task setup and learning. How will these properties influence model learning and is it generalizable across datasets? To answer this question, this work formally initiates the concept of $\textit{class-wise hardness}$. Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment. Subsequent experiments unveil a notable challenge in measuring such class-wise hardness with instance-level metrics in previous works. To address this, we propose $\textit{GeoHard}$ for class-wise hardness measurement by modeling class geometry in the semantic embedding space. $\textit{GeoHard}$ surpasses instance-level metrics by over 59 percent on $\textit{Pearson}$'s correlation on measuring class-wise hardness. Our analysis theoretically and empirically underscores the generality of $\textit{GeoHard}$ as a fresh perspective on data diagnosis. Additionally, we showcase how understanding class-wise hardness can practically aid in improving task learning.

$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

TL;DR

This work formally initiates the concept of

and proposes

for class-wise hardness measurement by modeling class geometry in the semantic embedding space by modeling class geometry in the semantic embedding space.

Abstract

. Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment. Subsequent experiments unveil a notable challenge in measuring such class-wise hardness with instance-level metrics in previous works. To address this, we propose

for class-wise hardness measurement by modeling class geometry in the semantic embedding space.

surpasses instance-level metrics by over 59 percent on

's correlation on measuring class-wise hardness. Our analysis theoretically and empirically underscores the generality of

as a fresh perspective on data diagnosis. Additionally, we showcase how understanding class-wise hardness can practically aid in improving task learning.

Paper Structure (54 sections, 1 theorem, 9 equations, 10 figures, 21 tables, 1 algorithm)

This paper contains 54 sections, 1 theorem, 9 equations, 10 figures, 21 tables, 1 algorithm.

Introduction
Formulation of Class-wise Hardness
Datasets
Calculation of Empirical Hardness $\tilde{\mathrm{H}}$
Inter-annotator disagreement
Fine-tuning
In-context Learning
GeoHard for class-wise hardness measurement
Notations
GeoHard
Semantic representation
Semantics-guided metrics
Implementation
Experiments
Baseline: Instance Hardness Aggregation
...and 39 more sections

Key Result

Theorem 1

Assuming a Gaussian distribution for instances within $c_k$, $D \sim \mathcal{N}(\mu_{c_k}, \sigma_{c_k}^2)$, the means of the training and test data can be represented as $\hat{\mu}_{c_k}^{tr} \sim \mathcal{N}(\mu_{c_k}, \sigma_{c_k}^2 / n_{tr})$ and $\hat{\mu}_{c_k}^{te} \sim \mathcal{N}(\mu_{c_k}

Figures (10)

Figure 1: The examples of premise-hypothesis pairs in uncertain NLI ($u$-NLI; DBLP:conf/acl/ChenJPSD20). In $u$-NLI, the probability of these pairs (in the parentheses) is annotated by crowdworkers. The example showcases NEU's Middlemost and Diverse semantics, i.e., positioning in the middle between ENT and CON and widely ranging from low (14%) to high probability (84%).
Figure 2: Correlation matrix among class-wise F1 scores of three finetuned models together with two ICLs and class-wise human disagreement on SNLI, where the high consistency is noted. Figure \ref{['fig:mnli_corr']} presents MNLI's correlation matrix in Appendix \ref{['appendix:hardness:llms']}.
Figure 3: The illustration of GeoHard in semantic embeddings space. The ellipses approximate class-wise data distribution. Class 2 is speculated to be difficult due to its large variance and middlemost location.
Figure 4: The ratio between F1 scores on the test and training sets for training epochs on NLI tasks. Neutralin blue suffers from overfitting most. Figure \ref{['fig:sc_dist_diff']} in Appendix \ref{['appendix:general:figure']} presents a similar issue in NLI tasks.
Figure 5: Average Pearson's coefficient between various metrics and hardness reference on five SC tasks. GeoHard with different embeddings consistently and significantly outperform instance-level aggregation, demonstrating the robustness of GeoHard.
...and 5 more figures

Theorems & Definitions (1)

Theorem 1

$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

TL;DR

Abstract

$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (1)