Table of Contents
Fetching ...

The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

Anwesha Mohanty, Venkatesh Balavadhani Parthasarathy, Arsalan Shahid

TL;DR

This work systematically evaluates prompting strategies for Multimodal Large Language Models (MLLMs) across 24 tasks using 13 open-source models, stratified by size. A four-evaluation-aspect framework (Reasoning, Multimodal Understanding, Code Generation, Knowledge Retrieval) is paired with seven prompting methods (Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, Tree-of-Thought) to assess performance and resource use under fixed inference conditions. Key findings show Large MLLMs excel in structured outputs such as code generation (up to 96.88% accuracy with Few-Shot) and achieve high relevance in multimodal understanding, but all models struggle with complex reasoning and exhibit hallucinations, especially with structured prompts in smaller models. No single prompting method optimizes all tasks; adaptive strategies that combine example guidance with selective structured reasoning yield better robustness, efficiency, and factual accuracy, offering practical guidance for deploying MLLMs in AI-assisted coding, knowledge retrieval, and multimodal content understanding.

Abstract

Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (<4B), Medium (4B-10B), and Large (>10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.

The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

TL;DR

This work systematically evaluates prompting strategies for Multimodal Large Language Models (MLLMs) across 24 tasks using 13 open-source models, stratified by size. A four-evaluation-aspect framework (Reasoning, Multimodal Understanding, Code Generation, Knowledge Retrieval) is paired with seven prompting methods (Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, Tree-of-Thought) to assess performance and resource use under fixed inference conditions. Key findings show Large MLLMs excel in structured outputs such as code generation (up to 96.88% accuracy with Few-Shot) and achieve high relevance in multimodal understanding, but all models struggle with complex reasoning and exhibit hallucinations, especially with structured prompts in smaller models. No single prompting method optimizes all tasks; adaptive strategies that combine example guidance with selective structured reasoning yield better robustness, efficiency, and factual accuracy, offering practical guidance for deploying MLLMs in AI-assisted coding, knowledge retrieval, and multimodal content understanding.

Abstract

Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (<4B), Medium (4B-10B), and Large (>10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.

Paper Structure

This paper contains 80 sections, 31 figures, 24 tables.

Figures (31)

  • Figure 1: A high-level overview of a typical MLLM pipeline. Multiple input modalities (e.g., images, video, audio) are first processed by dedicated modality encoders (e.g., ViT dosovitskiy2020image, CLIP-ViT radford2021learning, BEiT bao2021beit). The encoded features are then projected or transformed via components such as linear projections, MLPs, or cross-attention to align with the text embedding space. Finally, the LLM backbone (e.g., Qwen, LLaMA, Falcon) integrates these multimodal features for unified reasoning and generation.
  • Figure 2: Zero-shot Prompting Syntax
  • Figure 3: One-shot Prompting Syntax
  • Figure 4: Few-shot Prompting Syntax
  • Figure 5: Chain-of-Thought (CoT) Prompting Syntax
  • ...and 26 more figures