Table of Contents
Fetching ...

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Mengying Ge, Dongkai Tang, Mingyang Li

TL;DR

The paper tackles the limitation of fixed-label emotion datasets by proposing an open-vocabulary emotion recognition framework for videos using multimodal large language models. It integrates three components: (1) finetuning InternVL with a human-centric caption dataset via Swift+LoRA, (2) applying the AffectGPT trimodal framework to generate rich open-vocabulary emotion descriptions, and (3) a multi-model co-judgment strategy to fuse predictions and improve recall. A data-augmentation pipeline generates 26,000 high-quality captions from CH-SIMS, and experiments on MER2024-OV demonstrate improvements in open-vocabulary emotion understanding, with careful analysis of input modalities and fusion effects. The work highlights the potential of combining discriminative multimodal models with LLM-based reasoning to achieve richer emotion representations, albeit with a trade-off between recall and precision when integrating multiple models.

Abstract

Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data generation and processing, training methods, results generation and multi-model co-judgment. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

TL;DR

The paper tackles the limitation of fixed-label emotion datasets by proposing an open-vocabulary emotion recognition framework for videos using multimodal large language models. It integrates three components: (1) finetuning InternVL with a human-centric caption dataset via Swift+LoRA, (2) applying the AffectGPT trimodal framework to generate rich open-vocabulary emotion descriptions, and (3) a multi-model co-judgment strategy to fuse predictions and improve recall. A data-augmentation pipeline generates 26,000 high-quality captions from CH-SIMS, and experiments on MER2024-OV demonstrate improvements in open-vocabulary emotion understanding, with careful analysis of input modalities and fusion effects. The work highlights the potential of combining discriminative multimodal models with LLM-based reasoning to achieve richer emotion representations, albeit with a trade-off between recall and precision when integrating multiple models.

Abstract

Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data generation and processing, training methods, results generation and multi-model co-judgment. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.
Paper Structure (13 sections, 3 tables)