Table of Contents
Fetching ...

Hallucination Mitigation Prompts Long-term Video Understanding

Yiwei Sun, Zhihang Liu, Chuanbin Liu, Bowei Pu, Zhihan Zhang, Hongtao Xie

TL;DR

This paper tackles hallucinations in long-video understanding by proposing a holistic mitigation pipeline that combines CLIP-based frame sampling, a question-guided visual feature extractor, and generation-control strategies (Chain-of-Thought and In-Context Learning). It extends TimeChat with an image-understanding path for breakpoint queries and introduces a CLIP-based comparison mechanism to fuse image- and video-derived answers. On the MovieChat dataset, the approach achieves 84.2% global and 62.9% breakpoint performance, outperforming baselines and earning third place in CVPR LOVEU 2024 Long-Term Video Question Answering. The work offers a practical framework to reduce incorrect references and fabrication in long-video QA, with code released for reproducibility and extension.

Abstract

Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at https://github.com/lntzm/CVPR24Track-LongVideo

Hallucination Mitigation Prompts Long-term Video Understanding

TL;DR

This paper tackles hallucinations in long-video understanding by proposing a holistic mitigation pipeline that combines CLIP-based frame sampling, a question-guided visual feature extractor, and generation-control strategies (Chain-of-Thought and In-Context Learning). It extends TimeChat with an image-understanding path for breakpoint queries and introduces a CLIP-based comparison mechanism to fuse image- and video-derived answers. On the MovieChat dataset, the approach achieves 84.2% global and 62.9% breakpoint performance, outperforming baselines and earning third place in CVPR LOVEU 2024 Long-Term Video Question Answering. The work offers a practical framework to reduce incorrect references and fabrication in long-video QA, with code released for reproducibility and extension.

Abstract

Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at https://github.com/lntzm/CVPR24Track-LongVideo
Paper Structure (13 sections, 2 equations, 1 figure, 5 tables)

This paper contains 13 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The overall pipeline of our methodology. We introduce both training and inference technicals to get better results. For the training process, we make the visual content relevant to the instruction. For the inference process, we utilize both CoT and ICL for enhancement.