Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

Guangmin Zheng; Jin Wang; Xiaobing Zhou; Xuejie Zhang

Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

Guangmin Zheng, Jin Wang, Xiaobing Zhou, Xuejie Zhang

TL;DR

This work tackles hallucinations in multimodal chain-of-thought by introducing SNSE-CoT, which uses soft negative sampling and a Bidirectional Margin Loss within a two-stage vision-language framework. Five soft-negative generation methods produce semantically distinct but textually similar rationales, guiding a contrastive objective that separates soft negatives from positives while remaining close in text. Empirical results on ScienceQA show SNSE-CoT achieving state-of-the-art performance, especially on image-context questions, and demonstrate the importance of loss design and transformation choices for reasoning quality. The approach enhances the reliability of multimodal CoT and offers a pathway to generalize semantic discrimination in reasoning tasks.

Abstract

Chain of thought (CoT) has proven useful for problems requiring complex reasoning. Many of these problems are both textual and multimodal. Given the inputs in different modalities, a model generates a rationale and then uses it to answer a question. Because of the hallucination issue, the generated soft negative rationales with high textual quality but illogical semantics do not always help improve answer accuracy. This study proposes a rationale generation method using soft negative sampling (SNSE-CoT) to mitigate hallucinations in multimodal CoT. Five methods were applied to generate soft negative samples that shared highly similar text but had different semantics from the original. Bidirectional margin loss (BML) was applied to introduce them into the traditional contrastive learning framework that involves only positive and negative samples. Extensive experiments on the ScienceQA dataset demonstrated the effectiveness of the proposed method. Code and data are released at https://github.com/zgMin/SNSE-CoT.

Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 15 figures, 5 tables)

This paper contains 26 sections, 10 equations, 15 figures, 5 tables.

Introduction
Preliminary
Mitigating Hallucinated Generation
Soft Negative Sampling
Bidirectional Margin Loss
Training Objective
Experiments
Dataset
Implementation Details
Baselines
Comparative Results
Hyperparameter Fine-Tuning
Ablation Studies
Visual Latent Distribution of Samples
Case Analysis
...and 11 more sections

Figures (15)

Figure 1: The latent distribution of the samples. $R^+$ represents positive samples, $R^\#$ represents soft negative samples, and $R^-$ represents negative samples.
Figure 2: The overall architecture of the two-stage model.
Figure 3: Possible situations arising from non-observance of the modification principle. $R$ is the generated rationale.
Figure 4: Hyperparameters fine-tuning.
Figure 5: Visual latent distribution. “ours” represents samples generated by SNSE-CoT, “mm-cot” represents samples generated by Multimodal-CoT, “pos” represents positive samples, and “soft-neg” represents soft negative samples.
...and 10 more figures

Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

TL;DR

Abstract

Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (15)