Table of Contents
Fetching ...

DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model

Zhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Pengfei Li, Shuang Wu, Chong Li, Junhao Zhu, Hao Yang

TL;DR

HW-TSC presents an end-to-end DIMT system for complex layouts that unifies OCR-based and OCR-free translation within a single LVLM-powered framework. It introduces a training regime that combines multi-task learning (MTL) with perceptual chain-of-thought (PCOT) to jointly model visual layout understanding and cross-lingual content, followed by minimum Bayesian decoding (MBR) and post-processing during inference. The approach is evaluated on the DIMT25 dataset using InternVL-based models, showing that MTL-PCOT with SFT, MBR, and post-processing improves translation quality and that larger models yield greater gains. The work demonstrates a practical, reproducible path toward robust, end-to-end document image translation for real-world scenarios, with detailed data, methods, and results to support adoption and further research.

Abstract

This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.

DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model

TL;DR

HW-TSC presents an end-to-end DIMT system for complex layouts that unifies OCR-based and OCR-free translation within a single LVLM-powered framework. It introduces a training regime that combines multi-task learning (MTL) with perceptual chain-of-thought (PCOT) to jointly model visual layout understanding and cross-lingual content, followed by minimum Bayesian decoding (MBR) and post-processing during inference. The approach is evaluated on the DIMT25 dataset using InternVL-based models, showing that MTL-PCOT with SFT, MBR, and post-processing improves translation quality and that larger models yield greater gains. The work demonstrates a practical, reproducible path toward robust, end-to-end document image translation for real-world scenarios, with detailed data, methods, and results to support adoption and further research.

Abstract

This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.

Paper Structure

This paper contains 16 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Training Data Organization Structure for Our Method.