Table of Contents
Fetching ...

Semantic similarity prediction is better than other semantic similarity measures

Steffen Herbold

TL;DR

The paper probes how to quantify semantic similarity between text pairs, arguing that a regression-based approach using a fine-tuned transformer (STSScorer) yields better alignment with human judgments than traditional embedding- or n-gram-based metrics. It implements STSScorer by fine-tuning RoBERTa on STS-B and normalizing the logits to the interval $[0,1]$ using $logits/5$ with $MSE$ loss, and then evaluates this method against BLEU, BERTScore, and S-BERT across STS-B, MRPC, QQP, and WMT22-ZH-EN. Across STS-B, MRPC, and QQP, STSScorer shows stronger linear or rank correlations with human labels and higher AUCs, while BERTScore and BLEU underperform in several settings; on WMT22-ZH-EN, BERTScore sometimes aligns better with MQM judgments, highlighting the limits of a single semantic view. The authors discuss biases, data-coverage limitations, and the potential benefits of multi-faceted or multi-language similarity measures, suggesting that fine-tuned regression-based similarity measures are a practical and effective tool for semantic evaluation in NLP tasks. The work underscores the practical impact of adopting direct similarity prediction for robust semantic similarity assessment in real-world applications.

Abstract

Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.

Semantic similarity prediction is better than other semantic similarity measures

TL;DR

The paper probes how to quantify semantic similarity between text pairs, arguing that a regression-based approach using a fine-tuned transformer (STSScorer) yields better alignment with human judgments than traditional embedding- or n-gram-based metrics. It implements STSScorer by fine-tuning RoBERTa on STS-B and normalizing the logits to the interval using with loss, and then evaluates this method against BLEU, BERTScore, and S-BERT across STS-B, MRPC, QQP, and WMT22-ZH-EN. Across STS-B, MRPC, and QQP, STSScorer shows stronger linear or rank correlations with human labels and higher AUCs, while BERTScore and BLEU underperform in several settings; on WMT22-ZH-EN, BERTScore sometimes aligns better with MQM judgments, highlighting the limits of a single semantic view. The authors discuss biases, data-coverage limitations, and the potential benefits of multi-faceted or multi-language similarity measures, suggesting that fine-tuned regression-based similarity measures are a practical and effective tool for semantic evaluation in NLP tasks. The work underscores the practical impact of adopting direct similarity prediction for robust semantic similarity assessment in real-world applications.

Abstract

Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
Paper Structure (16 sections, 13 figures, 4 tables)

This paper contains 16 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Evaluation of similarity measures on the test data of STS-B. Ideally, the similarity correlates linearly with the labels, i.e., the scores are close to the black line.
  • Figure 2: Evaluation of similarity measures on the test data of MRPC. Ideally, the positive class (1) has scores close to one and the negative class (0) has smaller values, but not close to zero.
  • Figure 3: Evaluation of similarity measures on the training data of QQP. Ideally, the positive class (1) has scores close to one and the negative class (0) has smaller values, with only a small fraction being close to zero.
  • Figure 4: Receiver Operator Characteristics (ROC) and AUC measurements for using the semantic similarity metrics as classifiers for the MRPC and QQP data. A larger area is better.
  • Figure 5: Evaluation of similarity measures on the MQM labelled test data for translations from Chinese to English (ZH-EN) WMT-22 metrics challenge. Ideally, the similarity correlates linearly with the MQM labels, i.e., the scores are close to the black line.
  • ...and 8 more figures