Table of Contents
Fetching ...

FLIP Reasoning Challenge

Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer

TL;DR

The FLIP Reasoning Challenge introduces a multimodal benchmark derived from Idena where models must choose the coherent ordering of four images to form a meaningful story, testing sequential reasoning and visual storytelling. By comparing raw image inputs to caption-derived representations, the study shows caption-based reasoning markedly improves performance, with open-source ensembles reaching $85.2\%$ accuracy, though human performance remains at $95.3\%$. The results underscore the limitations of current multimodal reasoning systems, while highlighting the value of captioning, task reframing, and ensemble methods to bridge the gap toward human-level reasoning. FLIP provides a transparent ground-truth framework and a scalable benchmark to drive progress in robust multimodal reasoning research with practical implications for AI evaluation and alignment.

Abstract

Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.

FLIP Reasoning Challenge

TL;DR

The FLIP Reasoning Challenge introduces a multimodal benchmark derived from Idena where models must choose the coherent ordering of four images to form a meaningful story, testing sequential reasoning and visual storytelling. By comparing raw image inputs to caption-derived representations, the study shows caption-based reasoning markedly improves performance, with open-source ensembles reaching accuracy, though human performance remains at . The results underscore the limitations of current multimodal reasoning systems, while highlighting the value of captioning, task reframing, and ensemble methods to bridge the gap toward human-level reasoning. FLIP provides a transparent ground-truth framework and a scalable benchmark to drive progress in robust multimodal reasoning research with practical implications for AI evaluation and alignment.

Abstract

Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.

Paper Structure

This paper contains 43 sections, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Example of a Flip challenge from the Idena blockchain. The user is given 4 images presented in two different orderings (also referred to as stacks), and the user must select which stack of images tells a meaningful story. In this example, the answer is the right stack with the story of taking flour, mixing it with other ingredients, frying the dough, and then getting pancakes. Since this is a coherent story, then "right" is the correct answer.
  • Figure 2: Image the captions in \ref{['tab: caption examples']} are made for.
  • Figure 3: Number of words in the captions provided by ViPLlava 13B and BLIP2 Flan T5 XXL. We see that the former produces much longer captions than the latter.
  • Figure 4: Distribution of how many flips a certain number of models correctly labels. For instance, the 0 column (the leftmost column) indicates how many flips all considered models incorrectly labels. The three subfigures show the distribution for the three collections of top-performing models: only open-sourced models, only closed-sourced models, and all models. Two examples of challenges all open-sourced models fail on are shown in \ref{['fig: example of misclassified flips']}.
  • Figure 5: Correlation between model predictions for all 33 models we consider. The first 27 are the open-sourced models, while the last 6 are the closed-sourced models.
  • ...and 4 more figures