All in an Aggregated Image for In-Image Learning
Lei Wang, Wanyu Xu, Zhiqiang Hu, Yihuai Lan, Shan Dong, Hao Wang, Roy Ka-Wei Lee, Ee-Peng Lim
TL;DR
This work introduces In-Image Learning ($I^2$L), a method that aggregates demonstrations, visual cues, and chain-of-thought reasoning into a single image to improve multimodal reasoning with GPT-4V. It further proposes $I^2$L-Hybrid, which uses a GPT-4V-based selector to switch between $I^2$L and VT-ICL per instance. On MathVista, $I^2$L and $I^2$L-Hybrid achieve competitive to state-of-the-art results, with $I^2$L-Hybrid reaching the best average accuracy (~$52.8\%$) and ablation studies confirming the value of visual cues and CoT content. The approach reduces input burden, preserves rich visual information, and provides a scalable path toward stronger multimodal reasoning in large vision-language models; code is publicly released.
Abstract
This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$^2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into language models, I$^2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$^2$L-Hybrid, a method that combines the strengths of I$^2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$^2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$^2$L and I$^2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$^2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.
