Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

Andy Li; Wei Zhou; Rashina Hoda; Chris Bain; Peter Poon

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

Andy Li, Wei Zhou, Rashina Hoda, Chris Bain, Peter Poon

TL;DR

This pilot study compares large language models (GPT-4o, GEMMA-2, LLAMA-3.1) with traditional machine translation tools (Google Translate, Bing Translator, DeepL) for translating English medical consultation summaries into Arabic, Chinese, and Vietnamese. It uses two English summaries (simple patient-facing and complex clinician-facing) and evaluates translations with BLEU, CHR-F, and METEOR against professional references, revealing that traditional MT tools generally outperform LLMs on complex content, while LLMs show promise for simpler Vietnamese and Chinese translations; Arabic translations improve with complexity due to morphology. The findings highlight the limitations of current evaluation metrics in capturing clinical relevance and underscore the need for domain-specific training and human oversight to ensure patient safety in medical translation. Overall, the work informs responsible deployment of LLM-based translation in healthcare workflows and guides future improvements in evaluation and language resource development.

Abstract

This study evaluates how well large language models (LLMs) and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language's morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation.

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

TL;DR

Abstract

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)