Table of Contents
Fetching ...

All in an Aggregated Image for In-Image Learning

Lei Wang, Wanyu Xu, Zhiqiang Hu, Yihuai Lan, Shan Dong, Hao Wang, Roy Ka-Wei Lee, Ee-Peng Lim

TL;DR

This work introduces In-Image Learning ($I^2$L), a method that aggregates demonstrations, visual cues, and chain-of-thought reasoning into a single image to improve multimodal reasoning with GPT-4V. It further proposes $I^2$L-Hybrid, which uses a GPT-4V-based selector to switch between $I^2$L and VT-ICL per instance. On MathVista, $I^2$L and $I^2$L-Hybrid achieve competitive to state-of-the-art results, with $I^2$L-Hybrid reaching the best average accuracy (~$52.8\%$) and ablation studies confirming the value of visual cues and CoT content. The approach reduces input burden, preserves rich visual information, and provides a scalable path toward stronger multimodal reasoning in large vision-language models; code is publicly released.

Abstract

This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$^2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into language models, I$^2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$^2$L-Hybrid, a method that combines the strengths of I$^2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$^2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$^2$L and I$^2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$^2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.

All in an Aggregated Image for In-Image Learning

TL;DR

This work introduces In-Image Learning (L), a method that aggregates demonstrations, visual cues, and chain-of-thought reasoning into a single image to improve multimodal reasoning with GPT-4V. It further proposes L-Hybrid, which uses a GPT-4V-based selector to switch between L and VT-ICL per instance. On MathVista, L and L-Hybrid achieve competitive to state-of-the-art results, with L-Hybrid reaching the best average accuracy (~) and ablation studies confirming the value of visual cues and CoT content. The approach reduces input burden, preserves rich visual information, and provides a scalable path toward stronger multimodal reasoning in large vision-language models; code is publicly released.

Abstract

This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (IL) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into language models, IL consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce IL-Hybrid, a method that combines the strengths of IL with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (IL or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of IL and IL-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of IL. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.
Paper Structure (22 sections, 6 equations, 13 figures, 4 tables)

This paper contains 22 sections, 6 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: (a) Text-only in-context learning (T-ICL). (b) T-ICL with additional image-to-text models (T-ICL-Img). (c) Visual-text interleaved in-context learning (VT-ICL). (d) In-image learning (I$^2$L). For I$^2$L, we combine demonstrations (input image, visual cues, input text, output chain-of-thought reasoning, and output answer) and the test query (input image and input text), into an aggregated image. We then feed this aggregated image into LMMs to obtain the answer for the test query.
  • Figure 2: Overview of I$^2$L-Hybrid.
  • Figure 3: In-depth analysis for I$^2$L: (a) Impact of relative position of demonstrations and test examples in an aggregated image. T2B represents "Top to Bottom", meaning arranging the examples from top to bottom in sequence. B2T represents from bottom to top, L2R and R2L represent from left to right and from right to left. (b) Impact of resolution ratio. (c) Impact of the number of demonstrations. (d) Impact of thresholds for I$^2$L-Hybrid.
  • Figure 4: An case of the yesorno task. (a): Input with image demonstrations and in-context-learning from demonstrations to solve the test question. (b): Input with image demonstrations and learning from demonstrations to solve the test question.
  • Figure 5: An case of the yesorno task. (a): Input with Text-only in-context learning with additional image-to-text models to solve the test question. (b): Input with image demonstrations and in-context-learning from demonstrations to solve the test question.
  • ...and 8 more figures