Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification
Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin
TL;DR
Cross-modal long document classification is challenged by hierarchical text structure and noisy image signals. The authors introduce Hierarchical Multi-modal Transformer (HMT), a dual-transformer architecture that operates at section- and sentence-level features and links them with embedding images via a dynamic mask transfer mechanism, with a Dynamic Multi-scale Multi-modal Transformer capturing multi-scale sentence–image relations. The method demonstrates consistent, state-of-the-art performance across four datasets, including two newly created long-document corpora, outperforming both single-modality and existing multi-modal baselines and illustrating the value of hierarchical cross-modal interactions. These results advance cross-modal long document understanding by explicitly modeling multi-granularity text–image relationships and robust information flow between hierarchical levels.
Abstract
Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.
