Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

Parker Seegmiller; Joseph Gatto; Sarah Masud Preum

Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

Parker Seegmiller, Joseph Gatto, Sarah Masud Preum

TL;DR

Depth $F_1$ measures how well a model performs on target samples which are dissimilar from the source domain, to enable in-depth evaluation of the semantic generalizability of cross-domain text classification models.

Abstract

Recent evaluations of cross-domain text classification models aim to measure the ability of a model to obtain domain-invariant performance in a target domain given labeled samples in a source domain. The primary strategy for this evaluation relies on assumed differences between source domain samples and target domain samples in benchmark datasets. This evaluation strategy fails to account for the similarity between source and target domains, and may mask when models fail to transfer learning to specific target samples which are highly dissimilar from the source domain. We introduce Depth $F_1$, a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as $F_1$, Depth $F_1$ measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.

Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

TL;DR

Depth

measures how well a model performs on target samples which are dissimilar from the source domain, to enable in-depth evaluation of the semantic generalizability of cross-domain text classification models.

Abstract

, a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as

, Depth

measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.

Paper Structure (26 sections, 9 equations, 4 figures, 3 tables)

This paper contains 26 sections, 9 equations, 4 figures, 3 tables.

Introduction
Related Works
Cross-Domain Text Classification Benchmarks
Measuring Distances Between Corpora
A Novel Evaluation Strategy for Cross-Domain Text Classification
Depth Weights $w_i$
Depth $F_1$ ($DF_1$)
$\lambda$ Dissimilarity
Benchmark Cross-Domain Text Classification Data
Benchmark Data
Data Investigation with TTE Depth
Scope of $F_1$ for Cross-Domain Text Classification
Experiments
Benchmarking Models Using $DF_1$
Examining Model Behavior
...and 11 more sections

Figures (4)

Figure 1: Depth $F_1$ ($DF_1$) is a cross-domain text classification metric designed to measure a model's semantic generalizability. Both kitchen appliance reviews have positive sentiments. Still, the one highlighted in red is more challenging to classify due to its dissimilarity to samples in the source domain cell phone reviews. $DF_1$ re-weights target samples by dissimilarity to the source domain, enabling a more in-depth evaluation of model performance. Detailed discussion of these examples can be found in Appendix \ref{['app:sim_examples']}.
Figure 2: TTE depth scores of source and target samples in SiS-2 and MuS-3 pairings, with respect to source samples. The separated nature of the SiS-2 source and target domains indicates that the two domains are highly semantically dissimilar. However, MuS-3 is more semantically overlapping (brown bars resulting from overlap in the distributions indicated by red and green). This discrepancy highlights that not all cross-domain text classification datasets present equally challenging tasks.
Figure 3: $F_1$ and Depth $F_1$ ($DF_1$) results of the two demonstration models $A$ and $B$ on the SiS-1 cross-domain text classification {source, domain} pairing. While models $A$ and $B$ have nearly identical $F_1$ scores, they differ significantly in $DF_1$ scores, and that difference increases with $\lambda$ as more domain-similar target texts are removed from the target domain evaluation set.
Figure 4: Evaluation of cross-domain text classification models using $F_1$ and Depth $F_1$ ($DF_1$). We present both micro-average $F_1$ scores and micro-average $DF_1$ scores from $\lambda = 0$ to $\lambda = 90$, i.e., the percentage of most similar target texts that are not considered in the evaluation. Each result is averaged across two {source, target} domain pairings for both the sentiment analysis (SA) and natural language inference (NLI) tasks in both the single-source (SiS) and multi-source (MuS) scenarios. Model performance that decreases as $\lambda$ increases is indicated with solid lines, highlighting overfitting on source-similar texts. We give these results, along with corresponding F1 scores, in tabular format in Tables \ref{['tab:experiments_sa']} and \ref{['tab:experiments_nli']} of Appendix \ref{['app:investigate']}.

Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

TL;DR

Abstract

Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

Authors

TL;DR

Abstract

Table of Contents

Figures (4)