FLIP Reasoning Challenge
Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer
TL;DR
The FLIP Reasoning Challenge introduces a multimodal benchmark derived from Idena where models must choose the coherent ordering of four images to form a meaningful story, testing sequential reasoning and visual storytelling. By comparing raw image inputs to caption-derived representations, the study shows caption-based reasoning markedly improves performance, with open-source ensembles reaching $85.2\%$ accuracy, though human performance remains at $95.3\%$. The results underscore the limitations of current multimodal reasoning systems, while highlighting the value of captioning, task reframing, and ensemble methods to bridge the gap toward human-level reasoning. FLIP provides a transparent ground-truth framework and a scalable benchmark to drive progress in robust multimodal reasoning research with practical implications for AI evaluation and alignment.
Abstract
Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.
