Table of Contents
Fetching ...

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

TL;DR

Cross-modal long document classification is challenged by hierarchical text structure and noisy image signals. The authors introduce Hierarchical Multi-modal Transformer (HMT), a dual-transformer architecture that operates at section- and sentence-level features and links them with embedding images via a dynamic mask transfer mechanism, with a Dynamic Multi-scale Multi-modal Transformer capturing multi-scale sentence–image relations. The method demonstrates consistent, state-of-the-art performance across four datasets, including two newly created long-document corpora, outperforming both single-modality and existing multi-modal baselines and illustrating the value of hierarchical cross-modal interactions. These results advance cross-modal long document understanding by explicitly modeling multi-granularity text–image relationships and robust information flow between hierarchical levels.

Abstract

Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

TL;DR

Cross-modal long document classification is challenged by hierarchical text structure and noisy image signals. The authors introduce Hierarchical Multi-modal Transformer (HMT), a dual-transformer architecture that operates at section- and sentence-level features and links them with embedding images via a dynamic mask transfer mechanism, with a Dynamic Multi-scale Multi-modal Transformer capturing multi-scale sentence–image relations. The method demonstrates consistent, state-of-the-art performance across four datasets, including two newly created long-document corpora, outperforming both single-modality and existing multi-modal baselines and illustrating the value of hierarchical cross-modal interactions. These results advance cross-modal long document understanding by explicitly modeling multi-granularity text–image relationships and robust information flow between hierarchical levels.

Abstract

Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.
Paper Structure (27 sections, 20 equations, 9 figures, 6 tables)

This paper contains 27 sections, 20 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Example of multi-modal long document from the dataset of MAAPD.
  • Figure 2: Diagram of the hierarchical text structure of long documents, and their corresponding relations with the paired images.
  • Figure 3: The framework of the proposed HMT model for cross-modal long document classification.
  • Figure 4: An illustration of the proposed Sentence Token Generation (STG) block.
  • Figure 5: Schematic diagram of the Dynamic Multi-scale Multi-modal Transformer (DMMT).
  • ...and 4 more figures