Table of Contents
Fetching ...

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou

TL;DR

This work tackles language imbalance in Visual Information Extraction by proposing Language Decoupled Pre-training (LDP) and a Language Decoupled Model (LDM). LDP pretrains on language-independent visuals created via AnyText diffusion, enabling the model to generalize across languages using primarily monolingual English data, while LDM extends SAM with MTIM for multi-bounding-box information merging and LKI for language knowledge insertion during fine-tuning. Empirical results on multilingual benchmarks (XFUND, SIBR) show state-of-the-art cross-lingual generalization and competitive monolingual performance, with ablations highlighting the importance of MTIM and LKI. The approach offers a practical path to robust multilingual VIE by decoupling language bias from images and leveraging language-independent pretraining data.

Abstract

Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

TL;DR

This work tackles language imbalance in Visual Information Extraction by proposing Language Decoupled Pre-training (LDP) and a Language Decoupled Model (LDM). LDP pretrains on language-independent visuals created via AnyText diffusion, enabling the model to generalize across languages using primarily monolingual English data, while LDM extends SAM with MTIM for multi-bounding-box information merging and LKI for language knowledge insertion during fine-tuning. Empirical results on multilingual benchmarks (XFUND, SIBR) show state-of-the-art cross-lingual generalization and competitive monolingual performance, with ablations highlighting the importance of MTIM and LKI. The approach offers a practical path to robust multilingual VIE by decoupling language bias from images and leveraging language-independent pretraining data.

Abstract

Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: (a) Original image. (b) The previous method, LiLT, only decouples the layout modality across different languages, ignoring the vital appearance. (c) Our method remains vision and layout consistent with original image.
  • Figure 2: The text recognition ratio and language classification accuracy on XFUND. "ori" means the original image where the language bias is not decoupled by AnyText.
  • Figure 3: VIE performance on XFUND when applying language-decoupled images.
  • Figure 4: The overall illustration of LDM. LDM takes the image and bounding boxes as input, which exactly follows SAM's preprocessing and encoding. After each SAM decoder layer, MTIM is proposed to integrate information from different bounding boxes. A pre-trained frozen Sentence BERT is applied to augment language knowledge for downstream tasks.