Semantic similarity prediction is better than other semantic similarity measures

Steffen Herbold

Semantic similarity prediction is better than other semantic similarity measures

Steffen Herbold

TL;DR

The paper probes how to quantify semantic similarity between text pairs, arguing that a regression-based approach using a fine-tuned transformer (STSScorer) yields better alignment with human judgments than traditional embedding- or n-gram-based metrics. It implements STSScorer by fine-tuning RoBERTa on STS-B and normalizing the logits to the interval $[0,1]$ using $logits/5$ with $MSE$ loss, and then evaluates this method against BLEU, BERTScore, and S-BERT across STS-B, MRPC, QQP, and WMT22-ZH-EN. Across STS-B, MRPC, and QQP, STSScorer shows stronger linear or rank correlations with human labels and higher AUCs, while BERTScore and BLEU underperform in several settings; on WMT22-ZH-EN, BERTScore sometimes aligns better with MQM judgments, highlighting the limits of a single semantic view. The authors discuss biases, data-coverage limitations, and the potential benefits of multi-faceted or multi-language similarity measures, suggesting that fine-tuned regression-based similarity measures are a practical and effective tool for semantic evaluation in NLP tasks. The work underscores the practical impact of adopting direct similarity prediction for robust semantic similarity assessment in real-world applications.

Abstract

Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.

Semantic similarity prediction is better than other semantic similarity measures

TL;DR

using

with

loss, and then evaluates this method against BLEU, BERTScore, and S-BERT across STS-B, MRPC, QQP, and WMT22-ZH-EN. Across STS-B, MRPC, and QQP, STSScorer shows stronger linear or rank correlations with human labels and higher AUCs, while BERTScore and BLEU underperform in several settings; on WMT22-ZH-EN, BERTScore sometimes aligns better with MQM judgments, highlighting the limits of a single semantic view. The authors discuss biases, data-coverage limitations, and the potential benefits of multi-faceted or multi-language similarity measures, suggesting that fine-tuned regression-based similarity measures are a practical and effective tool for semantic evaluation in NLP tasks. The work underscores the practical impact of adopting direct similarity prediction for robust semantic similarity assessment in real-world applications.

Abstract

Paper Structure (16 sections, 13 figures, 4 tables)

This paper contains 16 sections, 13 figures, 4 tables.

Introduction
Method
STSScorer
Analysis approach
STS-B data
MRPC and QQP data
WMT22-ZH-EN
Tools used
Results
Discussion
Limitations
Conclusion
Results of additional experiments
Results when using BERT-base for STSScorer
Results when using an ensemble for scoring
...and 1 more sections

Figures (13)

Figure 1: Evaluation of similarity measures on the test data of STS-B. Ideally, the similarity correlates linearly with the labels, i.e., the scores are close to the black line.
Figure 2: Evaluation of similarity measures on the test data of MRPC. Ideally, the positive class (1) has scores close to one and the negative class (0) has smaller values, but not close to zero.
Figure 3: Evaluation of similarity measures on the training data of QQP. Ideally, the positive class (1) has scores close to one and the negative class (0) has smaller values, with only a small fraction being close to zero.
Figure 4: Receiver Operator Characteristics (ROC) and AUC measurements for using the semantic similarity metrics as classifiers for the MRPC and QQP data. A larger area is better.
Figure 5: Evaluation of similarity measures on the MQM labelled test data for translations from Chinese to English (ZH-EN) WMT-22 metrics challenge. Ideally, the similarity correlates linearly with the MQM labels, i.e., the scores are close to the black line.
...and 8 more figures

Semantic similarity prediction is better than other semantic similarity measures

TL;DR

Abstract

Semantic similarity prediction is better than other semantic similarity measures

Authors

TL;DR

Abstract

Table of Contents

Figures (13)