Semantic similarity prediction is better than other semantic similarity measures
Steffen Herbold
TL;DR
The paper probes how to quantify semantic similarity between text pairs, arguing that a regression-based approach using a fine-tuned transformer (STSScorer) yields better alignment with human judgments than traditional embedding- or n-gram-based metrics. It implements STSScorer by fine-tuning RoBERTa on STS-B and normalizing the logits to the interval $[0,1]$ using $logits/5$ with $MSE$ loss, and then evaluates this method against BLEU, BERTScore, and S-BERT across STS-B, MRPC, QQP, and WMT22-ZH-EN. Across STS-B, MRPC, and QQP, STSScorer shows stronger linear or rank correlations with human labels and higher AUCs, while BERTScore and BLEU underperform in several settings; on WMT22-ZH-EN, BERTScore sometimes aligns better with MQM judgments, highlighting the limits of a single semantic view. The authors discuss biases, data-coverage limitations, and the potential benefits of multi-faceted or multi-language similarity measures, suggesting that fine-tuned regression-based similarity measures are a practical and effective tool for semantic evaluation in NLP tasks. The work underscores the practical impact of adopting direct similarity prediction for robust semantic similarity assessment in real-world applications.
Abstract
Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
