Table of Contents
Fetching ...

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

Hong Li, Nanxi Li, Yuanjie Chen, Jianbin Zhu, Qinlu Guo, Cewu Lu, Yong-Lu Li

TL;DR

This work targets a core aspect of intelligence—association—by proposing an annotation-free method to construct a standardized multi-modal association benchmark for LLMs. It defines an adjective/verb–concept benchmark (object attributes/affordances and actions) and evaluates perception and three levels of memory-augmented association (single-step, synchronous, asynchronous) across open- and closed-source MLLMs and humans. Key findings show a persistent human–MLLM gap, with memory strategies (notably Natural Language Memory) and model ensembling (MoE) providing notable gains but leaving room for substantial improvement, even for GPT-4V and Gemini-1.5-Flash. The benchmark, data refinement pipeline, and analyses illuminate practical directions for building memory-aware, instruction-tuned multi-modal agents capable of robust, multi-step reasoning on unpaired data.

Abstract

Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $\textit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $\textbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $\textbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $\textit{Our data and code are available at:}$ https://mvig-rhos.com/llm_inception.

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

TL;DR

This work targets a core aspect of intelligence—association—by proposing an annotation-free method to construct a standardized multi-modal association benchmark for LLMs. It defines an adjective/verb–concept benchmark (object attributes/affordances and actions) and evaluates perception and three levels of memory-augmented association (single-step, synchronous, asynchronous) across open- and closed-source MLLMs and humans. Key findings show a persistent human–MLLM gap, with memory strategies (notably Natural Language Memory) and model ensembling (MoE) providing notable gains but leaving room for substantial improvement, even for GPT-4V and Gemini-1.5-Flash. The benchmark, data refinement pipeline, and analyses illuminate practical directions for building memory-aware, instruction-tuned multi-modal agents capable of robust, multi-step reasoning on unpaired data.

Abstract

Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, , hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: , a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. https://mvig-rhos.com/llm_inception.
Paper Structure (56 sections, 6 equations, 10 figures, 17 tables, 1 algorithm)

This paper contains 56 sections, 6 equations, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our insight and proposed association task. a) Association is always in our lives. b) Our proposed practical association task. c) The performance of current MLLMs and human experts. The results demonstrate a significant gap between current MLLMs and humans in association tasks.
  • Figure 2: Semantic figure of association question. (Left) Single-step association with a fixed correct memory. (Right) Synchronous and asynchronous association with dynamic memory. Synchronous involves one category in the chain, while asynchronous improves complexity with two categories.
  • Figure 3: The average Max $\mid$ Mean step on the individual concept synchronous association across three open-source MLLM and humans. Different columns within each MLLM indicate different memory strategies. For detailed results on each category refer to Table \ref{['tab:detailed_indi_attr_concept']}, \ref{['tab:detailed_indi_aff_concept']}, \ref{['tab:detailed_indi_verb_concept']} in the supplementary.
  • Figure 4: The average Max $\mid$ Mean step of asynchronous association with paired categories across different concepts. The upper subfigure is mean-step, while the lower subfigure is max-step. The detailed results on each different category group refer to Table \ref{['tab:detailed_ocl_attr_asynchronous_cand']}, \ref{['tab:detailed_ocl_aff_asynchronous_cand']}, \ref{['tab:detailed_pangea_verb_asynchronous_cand']} in the supplementary.
  • Figure 5: Failure cases of GPT4-V in asynchronous association with paired attribute categories. (Left) The error arises from the deduction. (Right) The error originates from the association.
  • ...and 5 more figures