Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck

Shiyao Cui; Jiangxia Cao; Xin Cong; Jiawei Sheng; Quangang Li; Tingwen Liu; Jinqiao Shi

Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck

Shiyao Cui, Jiangxia Cao, Xin Cong, Jiawei Sheng, Quangang Li, Tingwen Liu, Jinqiao Shi

TL;DR

This work tackles the challenge of leveraging visual information for multimodal NER (MNER) and multimodal relation extraction (MRE) on social media. It introduces MMIB, a framework that uses variational information bottleneck with a Refinement-Regularizer ($L_{rr}$) to suppress modality-noise and an Alignment-Regularizer ($L_{ar}$) to reduce modality-gap, enabling robust cross-modal representations. Text and image are encoded via BERT and ResNet-based pipelines, then learned with a variational APEnc that yields latent variables $\mathbf{Z}^T$ and $\mathbf{Z}^V$; $L_{rr}$ relies on KL-based MI bounds and a reconstruction term, while $L_{ar}$ maximizes mutual information between paired text-image representations through a discriminator, all optimized together with task losses for MNER and MRE. Experiments on Twitter-2015/2017 (MNER) and MNRE (MRE) demonstrate state-of-the-art results, with ablations confirming the complementary contributions of RR and AR and analyses illustrating noise suppression and alignment improvements. Overall, MMIB provides a principled, information-theoretic approach to cross-modal entity and relation extraction that can extend to broader multimodal tasks.

Abstract

This paper studies the multimodal named entity recognition (MNER) and multimodal relation extraction (MRE), which are important for multimedia social platform analysis. The core of MNER and MRE lies in incorporating evident visual information to enhance textual semantics, where two issues inherently demand investigations. The first issue is modality-noise, where the task-irrelevant information in each modality may be noises misleading the task prediction. The second issue is modality-gap, where representations from different modalities are inconsistent, preventing from building the semantic alignment between the text and image. To address these issues, we propose a novel method for MNER and MRE by Multi-Modal representation learning with Information Bottleneck (MMIB). For the first issue, a refinement-regularizer probes the information-bottleneck principle to balance the predictive evidence and noisy information, yielding expressive representations for prediction. For the second issue, an alignment-regularizer is proposed, where a mutual information-based item works in a contrastive manner to regularize the consistent text-image representations. To our best knowledge, we are the first to explore variational IB estimation for MNER and MRE. Experiments show that MMIB achieves the state-of-the-art performances on three public benchmarks.

Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck

TL;DR

) to suppress modality-noise and an Alignment-Regularizer (

) to reduce modality-gap, enabling robust cross-modal representations. Text and image are encoded via BERT and ResNet-based pipelines, then learned with a variational APEnc that yields latent variables

and

;

relies on KL-based MI bounds and a reconstruction term, while

maximizes mutual information between paired text-image representations through a discriminator, all optimized together with task losses for MNER and MRE. Experiments on Twitter-2015/2017 (MNER) and MNRE (MRE) demonstrate state-of-the-art results, with ablations confirming the complementary contributions of RR and AR and analyses illustrating noise suppression and alignment improvements. Overall, MMIB provides a principled, information-theoretic approach to cross-modal entity and relation extraction that can extend to broader multimodal tasks.

Abstract

Paper Structure (40 sections, 27 equations, 6 figures, 5 tables)

This paper contains 40 sections, 27 equations, 6 figures, 5 tables.

Introduction
Preliminary
Task Formulation
MNER
MRE
Mutual Information
Information Bottleneck Principle
Method
Encoding Module
Text Encoding
Image Encoding
Representation Learning Module
Variational Encoding
Refinement-Regularizer (RR)
Alignment-Regularizer (AR)
...and 25 more sections

Figures (6)

Figure 1: Examples from Twitter dataset for MNER and MRE.
Figure 2: Model Architecture.
Figure 3: Case study for modality-noise.
Figure 4: The change F1 for our model with different values of $\beta_1$ and $\beta_2$ under different datasets: (a) The change of F1 on Twitter15 dataset; (b) The change of F1 on Twitter17; (c) The change of F1 on MRE dataset
Figure 5: Representation visualization to modality-gap.
...and 1 more figures

Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck

TL;DR

Abstract

Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck

Authors

TL;DR

Abstract

Table of Contents

Figures (6)