Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao; Lei Gan; Yuankai Li; Yixin Ye; Dequan Wang

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao, Lei Gan, Yuankai Li, Yixin Ye, Dequan Wang

TL;DR

This work introduces the Self-Contradictory Instructions (SCI) benchmark to evaluate large multimodal models on detecting self-contradictory prompts, addressing a key gap in instruction robustness amid expanding context and multimodal inputs. SCI comprises $20{,}000$ conflicts across eight tasks, split evenly between language-language and vision-language settings, and is built with the AutoCreate automatic dataset framework that iterates seed-driven generator–decorator–cleaner cycles with expert validation. The authors also propose Cognitive Awakening Prompting (CaP), a plug-in prompting approach that injects external cognition to enhance dissonance detection, achieving substantial gains over standard in-context learning across both L-L and V-L tasks. Collectively, SCI, AutoCreate, and CaP offer a scalable platform to study instruction confounds, improve alignment, and foster more reliable human–AI interactions in multimodal contexts.

Abstract

Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection. The dataset and code are here: https://selfcontradiction.github.io/.

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

TL;DR

conflicts across eight tasks, split evenly between language-language and vision-language settings, and is built with the AutoCreate automatic dataset framework that iterates seed-driven generator–decorator–cleaner cycles with expert validation. The authors also propose Cognitive Awakening Prompting (CaP), a plug-in prompting approach that injects external cognition to enhance dissonance detection, achieving substantial gains over standard in-context learning across both L-L and V-L tasks. Collectively, SCI, AutoCreate, and CaP offer a scalable platform to study instruction confounds, improve alignment, and foster more reliable human–AI interactions in multimodal contexts.

Abstract

Paper Structure (51 sections, 2 equations, 4 figures, 7 tables)

This paper contains 51 sections, 2 equations, 4 figures, 7 tables.

Introduction
Related Work
Instruction Following
Information Inconsistency
Automatic Dataset Curation
Dataset
AutoCreate
SCI
Language-Language (L-L) Conflict
RuleConflict
AttributeConflict
ExclusionConflict
ForbbidenConflict
Vision-Language (V-L) Conflict
OCRConflict
...and 36 more sections

Figures (4)

Figure 1: Top: Children or language beginners meet conflicts for cognitive errors (SemanticConflict). Bottom: Increasing context length leads to contradictions (RuleConflict).
Figure 2: SCI comprises 10,000 language-language (L-L) and 10,000 vision-language (V-L) paradigms, each with 4 tasks.Top: L-L paradigm involves conflicts between context and instruction, such as designed rules, object attributes, exclusive directives, and forbidden words. Bottom: V-L paradigm covers multimodal conflicts, such as OCR images, figures, geometry, and semantics.
Figure 3: We propose AutoCreate, an automatic dataset creation framework that leverages programs and large language models.AutoCreate starts from several task-relevant seeds and maintains a seed pool. During each cycle, AutoCreate includes two branches, the language (left) and the vision (right). Each branch consists of a generator and a decorator. Finally, the cleaner will exclude data that does not meet the standards. The data will be fed into the seed pool for the next round after a quality check by human experts.
Figure 4: CaP improves LMMs' performance greatly on SCI-Core. Chain-of-thoughts and self-consistency prompting bring limited improvement. Replies are evaluated by human experts for more precise results.

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

TL;DR

Abstract

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)