Table of Contents
Fetching ...

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

Michelle Wastl, Jannis Vamvas, Rico Sennrich

TL;DR

SwissGov-RSD introduces the first human-annotated, cross-lingual, document-level benchmark for token-level semantic difference recognition, enabling realistic evaluation beyond synthetic data. The study benchmarks a spectrum of approaches from unsupervised baselines to few-shot prompting and fine-tuning, including a cross-lingual label-projection method, across English-German, English-French, and English-Italian. Major findings show a substantial drop in performance when moving from synthetic iSTS-RSD data to SwissGov-RSD, with encoders generally competitive and LLMs often constrained by length, formatting, and cross-lingual transfer challenges. The work emphasizes the need for methods that better align with human judgments on real-world multilingual documents and provides publicly available data and baselines to drive further research.

Abstract

Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

TL;DR

SwissGov-RSD introduces the first human-annotated, cross-lingual, document-level benchmark for token-level semantic difference recognition, enabling realistic evaluation beyond synthetic data. The study benchmarks a spectrum of approaches from unsupervised baselines to few-shot prompting and fine-tuning, including a cross-lingual label-projection method, across English-German, English-French, and English-Italian. Major findings show a substantial drop in performance when moving from synthetic iSTS-RSD data to SwissGov-RSD, with encoders generally competitive and LLMs often constrained by length, formatting, and cross-lingual transfer challenges. The work emphasizes the need for methods that better align with human judgments on real-world multilingual documents and provides publicly available data and baselines to drive further research.

Abstract

Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.

Paper Structure

This paper contains 46 sections, 2 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: Excerpt from an English-German document pair from the SwissGov-RSD dataset, annotated with token-level differences. The differences that we found range from explicitations to omitted paragraphs. The paragraph marked in deep red contains information about emergency calls and is completely omitted in the English document.
  • Figure 2: Architectures used in our experiments. An unsupervised approach (a), where each document is encoded separately and a difference alignment algorithm is used to predict difference scores for each token. We provide LLMs (b) with a natural language instruction and examples of the expected output. The two text segments to compare are provided in tokenized form. The LLM then autoregressively generates a JSON object with a score for each token. The token regressor (c) predicts a score for each encoded token in the sequence pair.
  • Figure 3: Label distribution of the final SwissGov-RSD dataset in tokens (separated by white spaces). 0-labeled tokens are not considered in this plot.
  • Figure 4: Excerpt of an EN-DE document pair with gold labels and predictions one model from each of the system categories listed in Table \ref{['tab:main-results']}.
  • Figure 5: Average Spearman correlation coefficient at different positions in the documents, averaged across language pairs. $n$ indicates the number of documents. Dotted lines represent DiffAlign approaches, dashed line the fine-tuned regressor, and solid LLMs.
  • ...and 12 more figures