Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

Chao Wang; Luning Zhang; Zheng Wang; Yang Zhou

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

Chao Wang, Luning Zhang, Zheng Wang, Yang Zhou

TL;DR

This work tackles the challenge of combinatorial reasoning across multiple perceptual inputs by introducing two benchmarks, CVQA and CPVQA, derived from the Can You Escape? game. It evaluates state-of-the-art multilingual multimodal models and proposes three plug-and-play baselines—contextual learning, minimum-margin COT without prompting, and semantic retrieval—to enhance cross-image reasoning. Results reveal that leading models still struggle on these tasks, with maximum CVQA performance around $36.86\%$ and CPVQA around $8.72\%$, while the proposed methods provide significant gains (up to $22.17\%$ on CVQA and $9.40\%$ on CPVQA) and highlight the importance of robust multimodal integration. The study emphasizes core challenges in multimodal combinatorial reasoning and provides a public codebase to spur further advances in robust, interpretable reasoning across complex, multisource scenes.

Abstract

Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

TL;DR

Abstract

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)