Table of Contents
Fetching ...

Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation

Dayeon Ki, Kevin Duh, Marine Carpuat

TL;DR

This study investigates how different forms of quality feedback influence monolingual users' willingness to share machine-translated content. It contrasts explicit feedback (error highlights, LLM explanations) with implicit feedback (backtranslation, QA tables) in a COVID-19 information scenario, using decision accuracy and confidence-weighted accuracy as key metrics. The findings show that implicit QA-table feedback yields the strongest improvements in both accuracy and appropriate reliance, while error highlights underperform. The work highlights the value of feedback that encourages users to judge translations themselves rather than prescribing a course of action, informing the design of MT quality feedback in real-world, user-centric contexts.

Abstract

As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.

Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation

TL;DR

This study investigates how different forms of quality feedback influence monolingual users' willingness to share machine-translated content. It contrasts explicit feedback (error highlights, LLM explanations) with implicit feedback (backtranslation, QA tables) in a COVID-19 information scenario, using decision accuracy and confidence-weighted accuracy as key metrics. The findings show that implicit QA-table feedback yields the strongest improvements in both accuracy and appropriate reliance, while error highlights underperform. The work highlights the value of feedback that encourages users to judge translations themselves rather than prescribing a course of action, informing the design of MT quality feedback in real-world, user-centric contexts.

Abstract

As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.

Paper Structure

This paper contains 60 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 2: During the AI-assisted decision-making step, each treatment group participant is presented with an English source, Spanish translation, and one of four randomly assigned quality feedback types. For error highlights, we also show a color-coded legend ( Minor | Major | Critical) and for QA table, answer texts are displayed in orange when they are identical or highly similar, else, blue.
  • Figure 3: Average decision accuracy (left) and CWA (right) for each condition. Paired-sample $t$-tests are performed to compare independent and AI-assisted performance and linear mixed-effects ANOVA with Bonferroni corrections to compare different treatment conditions. *: significant with $p$-value < 0.05; **: $p$ < 0.01; ***: $p$ < 0.001; Non-marked: not statistically significant. Detailed results are provided in Appendix \ref{['appendix:independent_aiassisted']}.
  • Figure 4: Average decision accuracy (left) and CWA (right) for each type of quality feedback and shareability label. $\mathbf{n}$ indicates the number of examples aggregated for each condition and label. Independent aggregates responses made without quality feedback across all conditions. **: statistically significant with $p$-value < 0.01; ***: $p$ < 0.001; Non-marked ones are not statistically significant. Detailed results are provided in Appendix \ref{['appendix:shareability_label']}.
  • Figure 5: Breakdown of switch percentages by quality feedback type, showing appropriate, over-, and under-reliance.
  • Figure 6: Screenshots of the instructions provided to bilingual annotators, along with an example question.
  • ...and 2 more figures