Table of Contents
Fetching ...

Text Change Detection in Multilingual Documents Using Image Comparison

Doyoung Park, Naresh Reddy Yarram, Sunjin Kim, Minkyu Kim, Seongho Cho, Taehee Lee

TL;DR

This work addresses multilingual document comparison by proposing text change detection (TCD) through image-to-image comparison rather than language-dependent OCR. The model uses a Siamese-like encoder–decoder architecture with multi-scale features, correlation marginalization, and cross-self transformer-based attention to produce bidirectional change maps $S_{st}$ and $S_{ts}$ without explicit text alignment. A synthetic multilingual training regime and a dedicated text-change dataset enable robust performance across languages, validated against semantic segmentation, change detection baselines, and OCR benchmarks; ablation confirms the value of correlation maps and two-way segmentation. Practically, the method offers language-agnostic, robust change detection suitable for contract review and other multilingual document workflows, achieving SotA results on segmentation benchmarks and competitive OCR-like performance without language-specific OCR models.

Abstract

Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.

Text Change Detection in Multilingual Documents Using Image Comparison

TL;DR

This work addresses multilingual document comparison by proposing text change detection (TCD) through image-to-image comparison rather than language-dependent OCR. The model uses a Siamese-like encoder–decoder architecture with multi-scale features, correlation marginalization, and cross-self transformer-based attention to produce bidirectional change maps and without explicit text alignment. A synthetic multilingual training regime and a dedicated text-change dataset enable robust performance across languages, validated against semantic segmentation, change detection baselines, and OCR benchmarks; ablation confirms the value of correlation maps and two-way segmentation. Practically, the method offers language-agnostic, robust change detection suitable for contract review and other multilingual document workflows, achieving SotA results on segmentation benchmarks and competitive OCR-like performance without language-specific OCR models.

Abstract

Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.

Paper Structure

This paper contains 30 sections, 6 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Text image Change Detection Model Architecture. The architecture is composed of four modules: encoder, feature attention, correlation marginalization, and decoder, from left to right.
  • Figure 2: Correlation and marginalization. The marginalization process creates a 3D correlation map ${C_{s}}$ for feature from ${F_{s}}$ to ${F_{t}}$.
  • Figure 3: Correlation feature map attention process. It shows attention process performed between upsampled correlation maps ${\hat{C}_{s}^{2}}$, ${\hat{C}_{s}^{3}}$, ${\hat{C}_{t}^{2}}$, ${\hat{C}_{t}^{3}}$ and corresponding top level feature maps ${F_s^{1}}, {F_t^{1}}$.
  • Figure 4: Sample of synthetic training data : (a) source, (b) target, (c) segmentation ground truth from source to target, (d) segmentation ground truth from target to source.
  • Figure 5: Qualitative results of different segmentation models.
  • ...and 6 more figures