Table of Contents
Fetching ...

Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

Grant Wardle, Teo Susnjak

TL;DR

This paper investigates how the sequencing of image and text inputs in multimodal prompts affects reasoning in large language models. It conducts zero-shot experiments with three commercial LLMs on two exam-style benchmarks (M3Exam and M3COTS), comparing image-first, text-first, and interleaved prompt structures, and analyzes how task complexity and prompt attributes modulate sensitivity. Key contributions include identifying context-dependent sequencing effects, inferring underlying fusion strategies (early vs late vs hybrid), and proposing practical guidelines for multi-modal prompt design. The findings have practical implications for education, medical imaging, and cross-modal reasoning where prompt structure and fusion behavior critically shape performance. <additional> The work emphasizes that physical information order generally matters more than priming, and that solving multi-hop reasoning requires maintaining context across steps, guiding future research on positional encoding and cross-modal integration in transformers.

Abstract

This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs). We performed empirical evaluations using three commercial LLMs. Our results demonstrate that the order in which modalities are presented can significantly affect performance, particularly in tasks of varying complexity. For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy. However, in more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task. Our findings also highlight the importance of question/prompt structure. In nested and multi-step reasoning tasks, modality sequencing played a key role in shaping model performance. While LLMs excelled in the initial stages of reasoning, they struggled to re-incorporate earlier information, underscoring the challenges of multi-hop reasoning within transformer architectures. This suggests that aligning the sequence of modalities with the logical flow of reasoning steps is more critical than modality order alone. These insights offer valuable implications for improving multi-modal prompt design, with broader applications across fields such as education, medical imaging, and cross-modal learning.

Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

TL;DR

This paper investigates how the sequencing of image and text inputs in multimodal prompts affects reasoning in large language models. It conducts zero-shot experiments with three commercial LLMs on two exam-style benchmarks (M3Exam and M3COTS), comparing image-first, text-first, and interleaved prompt structures, and analyzes how task complexity and prompt attributes modulate sensitivity. Key contributions include identifying context-dependent sequencing effects, inferring underlying fusion strategies (early vs late vs hybrid), and proposing practical guidelines for multi-modal prompt design. The findings have practical implications for education, medical imaging, and cross-modal reasoning where prompt structure and fusion behavior critically shape performance. <additional> The work emphasizes that physical information order generally matters more than priming, and that solving multi-hop reasoning requires maintaining context across steps, guiding future research on positional encoding and cross-modal integration in transformers.

Abstract

This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs). We performed empirical evaluations using three commercial LLMs. Our results demonstrate that the order in which modalities are presented can significantly affect performance, particularly in tasks of varying complexity. For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy. However, in more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task. Our findings also highlight the importance of question/prompt structure. In nested and multi-step reasoning tasks, modality sequencing played a key role in shaping model performance. While LLMs excelled in the initial stages of reasoning, they struggled to re-incorporate earlier information, underscoring the challenges of multi-hop reasoning within transformer architectures. This suggests that aligning the sequence of modalities with the logical flow of reasoning steps is more critical than modality order alone. These insights offer valuable implications for improving multi-modal prompt design, with broader applications across fields such as education, medical imaging, and cross-modal learning.
Paper Structure (32 sections, 1 equation, 11 figures, 9 tables)

This paper contains 32 sections, 1 equation, 11 figures, 9 tables.

Figures (11)

  • Figure 1: M3Exam example question 5
  • Figure 2: Set of images from the M3Exam dataset showing a complex set of image arrangements.
  • Figure 3: An example of a reconstructed M3Exam question.
  • Figure 4: Typical text/image layouts across M3COTS dataset questions with the image first.
  • Figure 5: Example of the structure of the API calls containing the prompts for different experimental configurations.
  • ...and 6 more figures