Table of Contents
Fetching ...

Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Zirui Shao, Feiyu Gao, Zhaoqing Zhu, Chuwei Luo, Hangdi Xing, Zhi Yu, Qi Zheng, Ming Yan, Jiajun Bu

TL;DR

This work identifies Cognition and Perception (C&P) knowledge conflicts in multimodal document understanding, showing that state-of-the-art MLLMs often produce cognition that mismatches OCR-perceived content. It formalizes C&P consistency, constructs evaluation samples across five benchmarks, and reveals substantial conflicts even in leading models. The authors propose Multimodal Knowledge Consistency Fine-tuning, comprising C&P Link Tokens and a C&P Connector, to strengthen the linkage between perception and cognition. Across three open-source MLLMs and five datasets, the method reduces C&P conflicts and yields concurrent gains in both cognitive and perceptual tasks, highlighting improved integration and explainability. The work also provides ablations, error analyses, and case studies to illuminate the sources of residual conflicts and the benefits of cross-verification between perceptual and cognitive knowledge.

Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands". Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflict, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 75.26% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. Our method reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks.

Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

TL;DR

This work identifies Cognition and Perception (C&P) knowledge conflicts in multimodal document understanding, showing that state-of-the-art MLLMs often produce cognition that mismatches OCR-perceived content. It formalizes C&P consistency, constructs evaluation samples across five benchmarks, and reveals substantial conflicts even in leading models. The authors propose Multimodal Knowledge Consistency Fine-tuning, comprising C&P Link Tokens and a C&P Connector, to strengthen the linkage between perception and cognition. Across three open-source MLLMs and five datasets, the method reduces C&P conflicts and yields concurrent gains in both cognitive and perceptual tasks, highlighting improved integration and explainability. The work also provides ablations, error analyses, and case studies to illuminate the sources of residual conflicts and the benefits of cross-verification between perceptual and cognitive knowledge.

Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands". Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflict, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 75.26% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. Our method reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks.

Paper Structure

This paper contains 29 sections, 5 equations, 6 figures, 20 tables.

Figures (6)

  • Figure 1: GPT-4o generates a VQA (cognition) answer that conflicts with the corresponding visual content identified by its OCR (perception). We refer to these multimodal knowledge conflicts in MLLMs as Cognition and Perception (C&P) knowledge conflicts.
  • Figure 2: a: C&P knowledge conflicts in current MLLMs. "*" denotes the "SFT-baseline" (see Section \ref{['sec:models']}). Additional quantitative results are provided in Section \ref{['app:mllm_inference_prompt']} and Table \ref{['tab:detailed_main_results']}. b: Results of the synthetic noise experiment, with additional details provided in Section \ref{['app:noise']}. c: The distribution of conflict patterns, including character-level errors (P1), cognitive bias (P2), and limited cognitive ability (P3), with one illustrative example for each.
  • Figure 3: An example illustrates the source data and its corresponding Multimodal Knowledge Consistency Fine-tuning sample. All mathematical symbols in the figure are consistent with those in Section \ref{['sec:method']}. Corresponding relationships use the same colors for clarity.
  • Figure 4: a: Comparison of the distribution of conflict patterns between InternVL2-2b* and InternVL2-2b (Ours). b: Two cases: b-1 demonstrates the effectiveness of our method, while b-2 reveals a limitation.
  • Figure 5: A specific example illustrates the evaluation sample. All mathematical symbols in the figure are consistent with those in Section \ref{['sec:data_construct']}. Corresponding relationships are represented using the same colors for clarity.
  • ...and 1 more figures