Table of Contents
Fetching ...

MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification

Yang Qiao, Xiaoyu Zhong, Xiaofeng Gu, Zhiguo Yu

TL;DR

MCFNet tackles fine-grained semantic classification by jointly exploiting visual and textual information through a regularized integrated fusion module and a hybrid attention mechanism that enables bidirectional cross-modal alignment. The framework comprises a text encoder (ALBERT) and an image encoder (ViT), a regularized fusion stage with dropout and ElasticNet plus a self-/cross-attention scheme, and a multimodal decision classifier guided by a multi-loss objective with learnable modality weights. Key contributions include the regularized feature representations, the efficient hybrid attention network, and the weighted voting strategy that mitigates information loss from any single modality. Empirical results on Con-Text and Drink Bottle show state-of-the-art performance and robust ablations demonstrate the importance of each module, indicating strong generalization and practical viability for multimodal fine-grained tasks.

Abstract

Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.

MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification

TL;DR

MCFNet tackles fine-grained semantic classification by jointly exploiting visual and textual information through a regularized integrated fusion module and a hybrid attention mechanism that enables bidirectional cross-modal alignment. The framework comprises a text encoder (ALBERT) and an image encoder (ViT), a regularized fusion stage with dropout and ElasticNet plus a self-/cross-attention scheme, and a multimodal decision classifier guided by a multi-loss objective with learnable modality weights. Key contributions include the regularized feature representations, the efficient hybrid attention network, and the weighted voting strategy that mitigates information loss from any single modality. Empirical results on Con-Text and Drink Bottle show state-of-the-art performance and robust ablations demonstrate the importance of each module, indicating strong generalization and practical viability for multimodal fine-grained tasks.

Abstract

Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.

Paper Structure

This paper contains 29 sections, 18 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of image classification results. (a) Image classification using only image unimodal data results in lower classification accuracy. (b) Utilizing multimodal data for image classification significantly enhances classification accuracy.
  • Figure 2: Detailed network architecture. The network employs pre-trained models ALBERT and ViT to extract textual and visual information, respectively. Subsequently, a multimodal regularization integration fusion network is used to fuse the features of scene text regions with those of visually salient objects. In the multimodal decision classification network, the network utilizes multiple loss functions and incorporates a dynamic weight adjustment mechanism to achieve precise fine-grained classification.
  • Figure 3: A comparison diagram of our multimodal hybrid attention network with two other popular interaction modules. (a) Interaction encoder. Text and image features are first fused by a cross-attention module, and then fed into a self-attention module. (b) Merged attention. Text and image features are first concatenated and then fed into their respective independent self-attention modules. (c) Our hybrid attention network. Text and image features are first fed into separate self-attention modules, and then sent to a cross-attention module for cross-modal interaction.
  • Figure 4: Overview of the multimodal decision classification network. Integrated features $O_T$, $O_H$, and $O_I$ are fed into corresponding text, interaction, and image branches. Each branch contains convolution, pooling, and fully connected layers followed by Softmax. Final predictions are fused via weighted voting with learnable coefficients, and the overall loss is computed accordingly.
  • Figure 5: PR curves of different ablation methods.
  • ...and 2 more figures