Multimodal Sentiment Analysis Based on BERT and ResNet
JiaLe Ren
TL;DR
The paper tackles multimodal sentiment analysis by fusing text and image data using BERT for textual representation and ResNet50 for visual features. It introduces five fusion strategies (CMACModel, HSTECModel, OTEModel, NativeCatModel, NativeCombineModel), with attention-based approaches shown to better exploit cross-modal information. On the MAVA-single dataset, the OTEModel delivers the best performance, outperforming single-modal baselines by about 4–8 percentage points and surpassing simple fusion methods, demonstrating effective cross-modal integration. The work highlights the practical potential of transformer- and residual-based fusion for robust multimodal sentiment analysis and outlines future directions toward more advanced fusion techniques and generalization.
Abstract
With the rapid development of the Internet and social media, multi-modal data (text and image) is increasingly important in sentiment analysis tasks. However, the existing methods are difficult to effectively fuse text and image features, which limits the accuracy of analysis. To solve this problem, a multimodal sentiment analysis framework combining BERT and ResNet was proposed. BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision. Firstly, BERT is used to extract the text feature vector, and ResNet is used to extract the image feature representation. Then, a variety of feature fusion strategies are explored, and finally the fusion model based on attention mechanism is selected to make full use of the complementary information between text and image. Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%. This study not only provides new ideas and methods for multimodal sentiment analysis, but also demonstrates the application potential of BERT and ResNet in cross-domain fusion. In the future, more advanced feature fusion techniques and optimization strategies will be explored to further improve the accuracy and generalization ability of multimodal sentiment analysis.
