DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
Zhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Pengfei Li, Shuang Wu, Chong Li, Junhao Zhu, Hao Yang
TL;DR
HW-TSC presents an end-to-end DIMT system for complex layouts that unifies OCR-based and OCR-free translation within a single LVLM-powered framework. It introduces a training regime that combines multi-task learning (MTL) with perceptual chain-of-thought (PCOT) to jointly model visual layout understanding and cross-lingual content, followed by minimum Bayesian decoding (MBR) and post-processing during inference. The approach is evaluated on the DIMT25 dataset using InternVL-based models, showing that MTL-PCOT with SFT, MBR, and post-processing improves translation quality and that larger models yield greater gains. The work demonstrates a practical, reproducible path toward robust, end-to-end document image translation for real-world scenarios, with detailed data, methods, and results to support adoption and further research.
Abstract
This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
