Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

Tengfei Liu; Yongli Hu; Junbin Gao; Yanfeng Sun; Baocai Yin

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

TL;DR

Cross-modal long document classification is challenged by hierarchical text structure and noisy image signals. The authors introduce Hierarchical Multi-modal Transformer (HMT), a dual-transformer architecture that operates at section- and sentence-level features and links them with embedding images via a dynamic mask transfer mechanism, with a Dynamic Multi-scale Multi-modal Transformer capturing multi-scale sentence–image relations. The method demonstrates consistent, state-of-the-art performance across four datasets, including two newly created long-document corpora, outperforming both single-modality and existing multi-modal baselines and illustrating the value of hierarchical cross-modal interactions. These results advance cross-modal long document understanding by explicitly modeling multi-granularity text–image relationships and robust information flow between hierarchical levels.

Abstract

Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

TL;DR

Abstract

Paper Structure (27 sections, 20 equations, 9 figures, 6 tables)

This paper contains 27 sections, 20 equations, 9 figures, 6 tables.

Introduction
Related Work
Long Document Classification
Document Image Classification
Multi-modal Transformer
Proposed Method
Feature Extraction
Textual Features
Visual Features
Hierarchical Multi-modal Transformer
Multi-modal Transformer
Dynamic Multi-scale Multi-modal Transformer
Dynamic Mask Transfer
Model Training
Experiments
...and 12 more sections

Figures (9)

Figure 1: Example of multi-modal long document from the dataset of MAAPD.
Figure 2: Diagram of the hierarchical text structure of long documents, and their corresponding relations with the paired images.
Figure 3: The framework of the proposed HMT model for cross-modal long document classification.
Figure 4: An illustration of the proposed Sentence Token Generation (STG) block.
Figure 5: Schematic diagram of the Dynamic Multi-scale Multi-modal Transformer (DMMT).
...and 4 more figures

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

TL;DR

Abstract

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (9)