Table of Contents
Fetching ...

VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos

Weihao Zhong, Yinhao Xiao, Minghui Xu, Xiuzhen Cheng

TL;DR

A novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content, which effectively utilizes different modal representations to generate a unified textual description, which is fed into a large language model for comprehensive evaluation.

Abstract

Short video platforms have become important channels for news dissemination, offering a highly engaging and immediate way for users to access current events and share information. However, these platforms have also emerged as significant conduits for the rapid spread of misinformation, as fake news and rumors can leverage the visual appeal and wide reach of short videos to circulate extensively among audiences. Existing fake news detection methods mainly rely on single-modal information, such as text or images, or apply only basic fusion techniques, limiting their ability to handle the complex, multi-layered information inherent in short videos. To address these limitations, this paper presents a novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content. This approach effectively utilizes different modal representations to generate a unified textual description, which is then fed into a large language model for comprehensive evaluation. The proposed framework successfully integrates multimodal features within videos, significantly enhancing the accuracy and reliability of fake news detection. Experimental results demonstrate that the proposed approach outperforms existing models in terms of accuracy, robustness, and utilization of multimodal information, achieving an accuracy of 90.93%, which is significantly higher than the best baseline model (SV-FEND) at 81.05%. Furthermore, case studies provide additional evidence of the effectiveness of the approach in accurately distinguishing between fake news, debunking content, and real incidents, highlighting its reliability and robustness in real-world applications.

VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos

TL;DR

A novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content, which effectively utilizes different modal representations to generate a unified textual description, which is fed into a large language model for comprehensive evaluation.

Abstract

Short video platforms have become important channels for news dissemination, offering a highly engaging and immediate way for users to access current events and share information. However, these platforms have also emerged as significant conduits for the rapid spread of misinformation, as fake news and rumors can leverage the visual appeal and wide reach of short videos to circulate extensively among audiences. Existing fake news detection methods mainly rely on single-modal information, such as text or images, or apply only basic fusion techniques, limiting their ability to handle the complex, multi-layered information inherent in short videos. To address these limitations, this paper presents a novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content. This approach effectively utilizes different modal representations to generate a unified textual description, which is then fed into a large language model for comprehensive evaluation. The proposed framework successfully integrates multimodal features within videos, significantly enhancing the accuracy and reliability of fake news detection. Experimental results demonstrate that the proposed approach outperforms existing models in terms of accuracy, robustness, and utilization of multimodal information, achieving an accuracy of 90.93%, which is significantly higher than the best baseline model (SV-FEND) at 81.05%. Furthermore, case studies provide additional evidence of the effectiveness of the approach in accurately distinguishing between fake news, debunking content, and real incidents, highlighting its reliability and robustness in real-world applications.

Paper Structure

This paper contains 37 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Architecture of the proposed framework VMID.The proposed approach integrates multiple modalities to generate text conclusions from short videos. It consists of three main components: CogVLM2, VSE, and Whisper. CogVLM2 processes keyframes extracted from the video, while VSE summarizes the video content. Whisper transcribes the audio component. These outputs are combined into an integrated prompt, which is fed into the LLMs. The LLMs employ cross-modal attention fusion at both medium and high levels to integrate information from text, video summaries, and audio inputs. Hierarchical fusion further combines these representations before passing them through a classification head and output layer to produce the final text conclusion.
  • Figure 2: Overview of the Text Processing Workflow. The process begins with the KeyFrame Extractor, which captures critical moments from the video where subtitles appear. These frames are processed by the Text Detection module, featuring the PFHead structure for initial adjustments, the PP-LCNetV3 backbone for feature extraction, a Detection Boxes Rectify module to localize subtitle areas, and a Text Recognition (SVTR) module to extract text. The final output provides both the extracted subtitle text and its corresponding timestamps, ensuring temporal accuracy.
  • Figure 3: The short video detection examples presented are differentiated by background colors to represent the nature of the video content. Videos with a yellow background correspond to content aimed at debunking rumors, those with a gray background denote fake content, and videos with a blue background indicate genuine content. Additionally, the "Video Summary" column in the table presents the content following image processing. Relevant information that contributes to the inference process is highlighted in blue text.
  • Figure 4: Loss curves for fine-tuning with LoRA on various models: Baichuanbaichuan2, GLM4glm4, InternLMinternlm, and Qwen2.5qwen2.5. The curves illustrate the training loss over time, demonstrating the convergence patterns of each model during the fine-tuning process.
  • Figure 5: Two representative fake news videos from FakeSV, one detected and the other missed by VMID, showcasing its performance in fake news detection.