Table of Contents
Fetching ...

OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction

Yang Li, Yajiao Wang, Wenhao Hu, Zhixiong Zhang, Mengting Zhang

TL;DR

OTCR addresses Multimodal Information Extraction by enforcing a text-dominant fusion with controllable visual supplementation. It combines Cross-modal Information Optimal Transport for sparse, context-aware text-to-visual alignments with a gating mechanism, and a Variational Information Bottleneck to compress fused features into minimal, task-relevant representations. Empirical results on FUNSD and XFUND(ZH) demonstrate competitive SER and RE scores, with ablations confirming the importance of both OT alignment and VIB compression. The approach provides an interpretable, information-theoretic framework for controllable multimodal fusion in document AI, enabling robust performance in visually rich settings.

Abstract

Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents. While recent methods have advanced multimodal representation learning, most implicitly assume modality equivalence or treat modalities in a largely uniform manner, still relying on generic fusion paradigms. This often results in indiscriminate incorporation of multimodal signals and insufficient control over task-irrelevant redundancy, which may in turn limit generalization. We revisit MIE from a task-centric view: text should dominate, vision should selectively support. We present OTCR, a two-stage framework. First, Cross-modal Optimal Transport (OT) yields sparse, probabilistic alignments between text tokens and visual patches, with a context-aware gate controlling visual injection. Second, a Variational Information Bottleneck (VIB) compresses fused features, filtering task-irrelevant noise to produce compact, task-adaptive representations. On FUNSD, OTCR achieves 91.95% SER and 91.13% RE, while on XFUND (ZH), it reaches 91.09% SER and 94.20% RE, demonstrating competitive performance across datasets. Feature-level analyses further confirm reduced modality redundancy and strengthened task signals. This work offers an interpretable, information-theoretic paradigm for controllable multimodal fusion in document AI.

OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction

TL;DR

OTCR addresses Multimodal Information Extraction by enforcing a text-dominant fusion with controllable visual supplementation. It combines Cross-modal Information Optimal Transport for sparse, context-aware text-to-visual alignments with a gating mechanism, and a Variational Information Bottleneck to compress fused features into minimal, task-relevant representations. Empirical results on FUNSD and XFUND(ZH) demonstrate competitive SER and RE scores, with ablations confirming the importance of both OT alignment and VIB compression. The approach provides an interpretable, information-theoretic framework for controllable multimodal fusion in document AI, enabling robust performance in visually rich settings.

Abstract

Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents. While recent methods have advanced multimodal representation learning, most implicitly assume modality equivalence or treat modalities in a largely uniform manner, still relying on generic fusion paradigms. This often results in indiscriminate incorporation of multimodal signals and insufficient control over task-irrelevant redundancy, which may in turn limit generalization. We revisit MIE from a task-centric view: text should dominate, vision should selectively support. We present OTCR, a two-stage framework. First, Cross-modal Optimal Transport (OT) yields sparse, probabilistic alignments between text tokens and visual patches, with a context-aware gate controlling visual injection. Second, a Variational Information Bottleneck (VIB) compresses fused features, filtering task-irrelevant noise to produce compact, task-adaptive representations. On FUNSD, OTCR achieves 91.95% SER and 91.13% RE, while on XFUND (ZH), it reaches 91.09% SER and 94.20% RE, demonstrating competitive performance across datasets. Feature-level analyses further confirm reduced modality redundancy and strengthened task signals. This work offers an interpretable, information-theoretic paradigm for controllable multimodal fusion in document AI.

Paper Structure

This paper contains 15 sections, 14 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The overall framework of OTCR, which integrates Cross-modal Optimal Transport for controllable visual-to-text injection and a Variational Information Bottleneck for redundancy filtering and task-relevant representation learning.
  • Figure 2: T-SNE visualization of the final-layer hidden embeddings on the FUNSD dataset across different models.
  • Figure 3: Ablation results on the FUNSD dataset.