Table of Contents
Fetching ...

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

Chao Wang, Luning Zhang, Zheng Wang, Yang Zhou

TL;DR

This work tackles the challenge of combinatorial reasoning across multiple perceptual inputs by introducing two benchmarks, CVQA and CPVQA, derived from the Can You Escape? game. It evaluates state-of-the-art multilingual multimodal models and proposes three plug-and-play baselines—contextual learning, minimum-margin COT without prompting, and semantic retrieval—to enhance cross-image reasoning. Results reveal that leading models still struggle on these tasks, with maximum CVQA performance around $36.86\%$ and CPVQA around $8.72\%$, while the proposed methods provide significant gains (up to $22.17\%$ on CVQA and $9.40\%$ on CPVQA) and highlight the importance of robust multimodal integration. The study emphasizes core challenges in multimodal combinatorial reasoning and provides a public codebase to spur further advances in robust, interpretable reasoning across complex, multisource scenes.

Abstract

Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

TL;DR

This work tackles the challenge of combinatorial reasoning across multiple perceptual inputs by introducing two benchmarks, CVQA and CPVQA, derived from the Can You Escape? game. It evaluates state-of-the-art multilingual multimodal models and proposes three plug-and-play baselines—contextual learning, minimum-margin COT without prompting, and semantic retrieval—to enhance cross-image reasoning. Results reveal that leading models still struggle on these tasks, with maximum CVQA performance around and CPVQA around , while the proposed methods provide significant gains (up to on CVQA and on CPVQA) and highlight the importance of robust multimodal integration. The study emphasizes core challenges in multimodal combinatorial reasoning and provides a public codebase to spur further advances in robust, interpretable reasoning across complex, multisource scenes.

Abstract

Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.

Paper Structure

This paper contains 19 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Task Example. Illustrates combinatorial reasoning with multiple perceptual inputs in complex scenarios.
  • Figure 2: Solution Formats. The solution formats for different benchmarks in CVQA and CPVQA.
  • Figure 3: Extension Methods. Two examples of extension methods applied to the same scenario. In these examples, (A) indicates that expansion without altering the original answer, while (B) indicates that expansion by altering the original answer.
  • Figure 4: Extension Method Validity. Original benchmark vs. Extended benchmark: The number of differing results across various methods and models for the same task highlights the effectiveness of the extended benchmark, where 'a' represents CVQA, and 'b' represents CPVQA. Numbers '1', '2' and '3' correspond to the LLMs reasoning and MLLMs reasoning of method model inference (Sec. \ref{['subsec:LLMs and MLLMs']}) and semantic retrieval (Sec. \ref{['subsec:semantic and visual retrieval']}) respectively.
  • Figure 5: Exploration Experiments. Examples guide models to solve complex scenes through reasoning with manual prompts.
  • ...and 3 more figures