M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation

Hongcheng Liu; Pingjie Wang; Yu Wang; Yanfeng Wang

M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation

Hongcheng Liu, Pingjie Wang, Yu Wang, Yanfeng Wang

TL;DR

The paper tackles hallucination in video-grounded dialogue generation (VDG) by analyzing how multimodal knowledge anchors influence outputs and showing that hallucination patterns vary across models. It proposes M2K-VDG, a model-adaptive two-stage framework that first detects multimodal knowledge anchor tokens using perplexity-based and counterfactual-effect-based methods, then reduces hallucinations by normalizing these anchors and injecting them into training via an anchor-enhanced loss. A counterfactual formulation is used to robustly identify anchor tokens, with $P(Y|K,Q) = \prod_{t=1}^{T} P(y_t|K,Q,Y_{<t})$ serving as the generation objective and $W_{CF} = | \log P_1(y_t|Q,Y_{<t}) - \log P_2(y_t|K,Q,Y_{<t}) |$ guiding anchor detection. Experiments on AVSD10, NExT-OE, and MUSIC-AVQA show consistent improvements over state-of-the-art baselines, demonstrating strong reductions in hallucinations and better grounding in multimodal knowledge across diverse VDG tasks.

Abstract

Video-grounded dialogue generation (VDG) requires the system to generate a fluent and accurate answer based on multimodal knowledge. However, the difficulty in multimodal knowledge utilization brings serious hallucinations to VDG models in practice. Although previous works mitigate the hallucination in a variety of ways, they hardly take notice of the importance of the multimodal knowledge anchor answer tokens. In this paper, we reveal via perplexity that different VDG models experience varying hallucinations and exhibit diverse anchor tokens. Based on this observation, we propose M2K-VDG, a model-adaptive multimodal knowledge anchor enhancement framework for hallucination reduction. Furthermore, we introduce the counterfactual effect for more accurate anchor token detection. The experimental results on three popular benchmarks exhibit the superiority of our approach over state-of-the-art methods, demonstrating its effectiveness in reducing hallucinations.

M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation

TL;DR

serving as the generation objective and

guiding anchor detection. Experiments on AVSD10, NExT-OE, and MUSIC-AVQA show consistent improvements over state-of-the-art baselines, demonstrating strong reductions in hallucinations and better grounding in multimodal knowledge across diverse VDG tasks.

Abstract

Paper Structure (30 sections, 12 equations, 6 figures, 6 tables)

This paper contains 30 sections, 12 equations, 6 figures, 6 tables.

Introduction
Related Works
Preliminaries
Causal Graph
Counterfactual notations
Counterfactual Effect
Methods
Task Formulation
Overview
Anchor Token Detection
Perplexity-based Detection
Counterfactual Effect-based Detection
Model Hallucination Reduction
Weight Normalization
Anchor Enhanced Loss
...and 15 more sections

Figures (6)

Figure 1: The demonstration of the video-grounded dialogue generation task, where the system is required to generate a fluent and accurate response according to the multimodal knowledge. However, the difficulty in multimodal knowledge utilization always leads to the experience of the hallucination.
Figure 2: The perplexity of answer tokens derived by different VDG models trained with the same multimodal content, and the higher perplexity means the more serious hallucination. The question is 'tell me the sequence in which the event occurred?', and the label in red denotes the knowledge-related tokens of ground answers through human detection. It is noted that different models experience hallucinations in various anchor tokens, which reflects the various multimodal knowledge anchors among different models.
Figure 3: The illustration of the causal effect.
Figure 4: The overview of the M2K-VDG: (a) Anchor Token Detection aims to detect the multimodal knowledge anchor tokens in the grounded answer through well-trained models, and is categorized into perplexity- and counterfactual effect-based. (b) Model Hallucination Reduction normalizes the anchor weight and focuses on the new model training via temperature change derived by the normalized weight.
Figure 5: The anchor degree of token derived by two detection methods. The question is 'is it the last you see of him when he walks out of frame ?' and the red label denotes the knowledge-related tokens through human detection. We observe that the perplexity-based technique locates 'him', 'toward', and 'fridge' as the anchor tokens, which should be a set of 'yes' and 'fridge'. By contrast, the counterfactual effect-based method can detect them accurately, which demonstrates its effectiveness and robustness.
...and 1 more figures

M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation

TL;DR

Abstract

M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)