Table of Contents
Fetching ...

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, Jiebo Luo

TL;DR

This work tackles the challenge of fine-grained perception and cross-image reasoning in large multimodal models by introducing Contrastive Chain-of-Thought (CoCoT) prompting. CoCoT guides models to compare similarities and differences across multiple image inputs before answering, improving performance on both image-to-image matching and multi-image-to-text matching across open-source and closed-source LMMs. Experiments on Raven-50, Factify-V, and Winoground show consistent gains over DDCoT and CCoT baselines, though gaps to human performance persist. The approach offers a principled way to leverage inter-image contrasts for more accurate and detailed multimodal reasoning, with potential to inform future AGI-oriented systems.

Abstract

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

TL;DR

This work tackles the challenge of fine-grained perception and cross-image reasoning in large multimodal models by introducing Contrastive Chain-of-Thought (CoCoT) prompting. CoCoT guides models to compare similarities and differences across multiple image inputs before answering, improving performance on both image-to-image matching and multi-image-to-text matching across open-source and closed-source LMMs. Experiments on Raven-50, Factify-V, and Winoground show consistent gains over DDCoT and CCoT baselines, though gaps to human performance persist. The approach offers a principled way to leverage inter-image contrasts for more accurate and detailed multimodal reasoning, with potential to inform future AGI-oriented systems.

Abstract

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.
Paper Structure (18 sections, 6 figures, 3 tables)

This paper contains 18 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison between different multimodal prompting strategies. The unique components in each prompting strategy's corresponding response are highlighted in varied colors. Note that GPT-4V is used in this example.
  • Figure 2: Different CoT-based methods and their performance in extracting information from images under various conditions, with GPT-4V being used in the experiments. Left: Utilizing CCoT to generate image information; Middle: CoCoT prompting between images with a big domain gap; Right: CoCoT prompting between images with a small domain gap.
  • Figure 3: An example question from the image-to-image matching task, sourced from the Raven-50 zhang2019ravenhuang2023language dataset.
  • Figure 4: Sampled questions from the Raven-50, Factify-V, and Winoground datasets.
  • Figure 5: An example response generated by GPT-4V via CoCoT on the Raven-50 dataset.
  • ...and 1 more figures