Table of Contents
Fetching ...

Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge

Bo Pan, Xuan Kan, Kaitai Zhang, Yan Yan, Shunwen Tan, Zihao He, Zixin Ding, Junjie Wu, Liang Zhao

TL;DR

BLPO addresses the challenge of aligning multimodal LLM judges with human judgments by jointly optimizing the judge prompt and a learnable image-to-text prompt to overcome context-window limits. It introduces a bi-level optimization framework where an inner loop refines image verbalizations and an outer loop updates evaluation prompts, mitigating context-length constraints. Across four image-centric datasets and three LLM judges, BLPO demonstrates improved alignment with human judgments and more stable convergence than existing APO baselines. This approach enables scalable, task-aware multimodal evaluation with reduced need for task-specific retraining.

Abstract

Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains challenging. While supervised fine-tuning on human-labeled data can improve alignment, it is costly and inflexible, requiring new training for each task or dataset. Recent progress in auto prompt optimization (APO) offers a more efficient alternative by automatically improving the instructions that guide LLM judges. However, existing APO methods primarily target text-only evaluations and remain underexplored in multimodal settings. In this work, we study auto prompt optimization for multimodal LLM-as-a-judge, particularly for evaluating AI-generated images. We identify a key bottleneck: multimodal models can only process a limited number of visual examples due to context window constraints, which hinders effective trial-and-error prompt refinement. To overcome this, we propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues. Our bi-level optimization approach jointly refines the judge prompt and the I2T prompt to maintain fidelity under limited context budgets. Experiments on four datasets and three LLM judges demonstrate the effectiveness of our method.

Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge

TL;DR

BLPO addresses the challenge of aligning multimodal LLM judges with human judgments by jointly optimizing the judge prompt and a learnable image-to-text prompt to overcome context-window limits. It introduces a bi-level optimization framework where an inner loop refines image verbalizations and an outer loop updates evaluation prompts, mitigating context-length constraints. Across four image-centric datasets and three LLM judges, BLPO demonstrates improved alignment with human judgments and more stable convergence than existing APO baselines. This approach enables scalable, task-aware multimodal evaluation with reduced need for task-specific retraining.

Abstract

Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains challenging. While supervised fine-tuning on human-labeled data can improve alignment, it is costly and inflexible, requiring new training for each task or dataset. Recent progress in auto prompt optimization (APO) offers a more efficient alternative by automatically improving the instructions that guide LLM judges. However, existing APO methods primarily target text-only evaluations and remain underexplored in multimodal settings. In this work, we study auto prompt optimization for multimodal LLM-as-a-judge, particularly for evaluating AI-generated images. We identify a key bottleneck: multimodal models can only process a limited number of visual examples due to context window constraints, which hinders effective trial-and-error prompt refinement. To overcome this, we propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues. Our bi-level optimization approach jointly refines the judge prompt and the I2T prompt to maintain fidelity under limited context budgets. Experiments on four datasets and three LLM judges demonstrate the effectiveness of our method.
Paper Structure (25 sections, 11 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 11 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: The long-context issue for MLLMs for prompt optimization.
  • Figure 2: Illustration of the framework. Left: MLLM-as-a-Judge for image evaluation. Right: Our proposed BLPO framework for optimizing the prompt for MLLM-as-a-Judge. Image examples are drawn from the ImageReward dataset.
  • Figure 3: The optimization curves on Llama4-Maverick backbone on four datasets.
  • Figure 4: Comparison of model performance on ImageReward (top row) and UnsafeBench (bottom row) under varying (a,d) batch sizes, (b,e) inner-level steps, and (c,f) outer-level steps.