Table of Contents
Fetching ...

Findings of the WMT 2024 Shared Task on Chat Translation

Wafaa Mohammed, Sweta Agrawal, M. Amin Farajian, Vera Cabarrão, Bryan Eikema, Ana C. Farinha, José G. C. de Souza

TL;DR

The analysis shows that while the systems excelled at translating individual turns, there is room for improvement in overall conversation-level translation quality.

Abstract

This paper presents the findings from the third edition of the Chat Translation Shared Task. As with previous editions, the task involved translating bilingual customer support conversations, specifically focusing on the impact of conversation context in translation quality and evaluation. We also include two new language pairs: English-Korean and English-Dutch, in addition to the set of language pairs from previous editions: English-German, English-French, and English-Brazilian Portuguese. We received 22 primary submissions and 32 contrastive submissions from eight teams, with each language pair having participation from at least three teams. We evaluated the systems comprehensively using both automatic metrics and human judgments via a direct assessment framework. The official rankings for each language pair were determined based on human evaluation scores, considering performance in both translation directions--agent and customer. Our analysis shows that while the systems excelled at translating individual turns, there is room for improvement in overall conversation-level translation quality.

Findings of the WMT 2024 Shared Task on Chat Translation

TL;DR

The analysis shows that while the systems excelled at translating individual turns, there is room for improvement in overall conversation-level translation quality.

Abstract

This paper presents the findings from the third edition of the Chat Translation Shared Task. As with previous editions, the task involved translating bilingual customer support conversations, specifically focusing on the impact of conversation context in translation quality and evaluation. We also include two new language pairs: English-Korean and English-Dutch, in addition to the set of language pairs from previous editions: English-German, English-French, and English-Brazilian Portuguese. We received 22 primary submissions and 32 contrastive submissions from eight teams, with each language pair having participation from at least three teams. We evaluated the systems comprehensively using both automatic metrics and human judgments via a direct assessment framework. The official rankings for each language pair were determined based on human evaluation scores, considering performance in both translation directions--agent and customer. Our analysis shows that while the systems excelled at translating individual turns, there is room for improvement in overall conversation-level translation quality.

Paper Structure

This paper contains 39 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Screen capture of the Appraise interface used by professional linguists to perform human evaluation.
  • Figure 2: MuDA F1 scores across all settings.
  • Figure 3: Conversation-level DA scores.
  • Figure 4: Turn-level DA score across different language pairs through a chat.
  • Figure 5: Turn avg. vs conversation-level DA scores.