Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Mustafa Shukor; Alexandre Rame; Corentin Dancette; Matthieu Cord

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord

TL;DR

The paper tackles the gap between task-performance metrics and genuine alignment of large multimodal models by evaluating 10 open-source LMMs across five axes: object hallucination, abstention, compositionality, explainability, and instruction following. It investigates training-free multimodal in-context learning and introduces three variants—Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL—to address these flaws without fine-tuning. Key findings show that scaling alone does not eliminate flaws; ICL can improve explainability and abstention, may not improve compositionality, and can even amplify hallucinations, while the proposed X-ICL variants offer promising post-hoc remedies. The work highlights the potential and limits of post-hoc, training-free alignment and provides public code to enable further exploration and reproducibility.

Abstract

Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL.

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

TL;DR

Abstract

Paper Structure (59 sections, 7 equations, 19 figures, 11 tables)

This paper contains 59 sections, 7 equations, 19 figures, 11 tables.

Introduction
LMMs evaluation and multimodal ICL
Background on LMMs and ICL.
Hallucination
Abstention
Compositionality
Explainability
Instruction following
Rectifying the flaws of LMMs with multimodal ICL (X-ICL)
Related Work
Discussion
Discussion
Other limitations and evaluation axes.
ICL as a way to address foundation model limitations.
Other LMMs and foundation models.
...and 44 more sections

Figures (19)

Figure 1: Evaluation framework. We study LMMs following 3 strategies, on different axes; hallucinations, abstention, compositionality, explainability and instruction following. In addition to an image <image> and a question T used in zero-shot (ZS), in-context learning (ICL) considers $N$ demonstrations of images-questions-answers ($\textcolor{black}{\small<image>}\xspace_i,{\textcolor{black}{\smallT}}\xspace_i,\textcolor{black}{\smallR}\xspace_i$) as input $X$, augmented by a function $f$ in our X-ICL.
Figure 2: Evaluation of LMMs on OH (left) and abstention (right). $\Delta$ refers to zero-shot and the $\star$ size refers to the number of shots in ICL.
Figure 3: Compositionality. Models are evaluated on the CREPE and SugarCREPE with the ITM task.
Figure 4: Explainability. Models are asked to generate an explanation for image, question and answer triplets from the VQA-X dataset
Figure 5: Instruction following. Evaluation on the LlaVA benchmark on 3 types of instructions: detailed descriptions, complex questions and conversations. Left: example with OFv2-9B. Right: average scores (over 3 instruction types) given by GPT-4. Detailed scores for each type in \ref{['app:eval']}
...and 14 more figures

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

TL;DR

Abstract

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)