LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Huawen Shen; Gengluo Li; Jinwen Zhong; Yu Zhou

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou

TL;DR

This work tackles language imbalance in Visual Information Extraction by proposing Language Decoupled Pre-training (LDP) and a Language Decoupled Model (LDM). LDP pretrains on language-independent visuals created via AnyText diffusion, enabling the model to generalize across languages using primarily monolingual English data, while LDM extends SAM with MTIM for multi-bounding-box information merging and LKI for language knowledge insertion during fine-tuning. Empirical results on multilingual benchmarks (XFUND, SIBR) show state-of-the-art cross-lingual generalization and competitive monolingual performance, with ablations highlighting the importance of MTIM and LKI. The approach offers a practical path to robust multilingual VIE by decoupling language bias from images and leveraging language-independent pretraining data.

Abstract

Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

TL;DR

Abstract

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)