Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Tinghui Zhu; Qin Liu; Fei Wang; Zhengzhong Tu; Muhao Chen

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Tinghui Zhu, Qin Liu, Fei Wang, Zhengzhong Tu, Muhao Chen

TL;DR

This paper formally defines the problem of $\textbf{cross-modality parametric knowledge conflict}$ and presents a systematic approach to detect, interpret, and mitigate them, and introduces a pipeline that identifies conflicts between visual and textual answers.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of $\textbf{cross-modality parametric knowledge conflict}$ and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

TL;DR

This paper formally defines the problem of

and presents a systematic approach to detect, interpret, and mitigate them, and introduces a pipeline that identifies conflicts between visual and textual answers.

Abstract

and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.

Paper Structure (23 sections, 9 equations, 3 figures, 9 tables)

This paper contains 23 sections, 9 equations, 3 figures, 9 tables.

Introduction
Related Work
Preliminaries
Definitions
Experimental Setup
Datasets Construction
Evaluation Metrics
Models
Detecting Parametric Knowledge Conflicts
Method
Metric
Analysis
Interpreting parametric knowledge conflicts
Is probability a reliable indicator of answer correctness?
Contrastive metric as indicator of conflicts
...and 8 more sections

Figures (3)

Figure 1: A conflict case of different input modalities with the same information. The conflict still happens even when the visual components recognize the Sydney Opera House.
Figure 2: Relationship of conflicting samples.
Figure 3: Distribution of the contrastive metric on all samples, samples with modality-consistent answers, and samples with modality-conflict answers. The dashed lines indicate the medians.

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

TL;DR

Abstract

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)