Table of Contents
Fetching ...

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

TL;DR

Flex-Judge introduces a reasoning-guided multimodal evaluator trained solely on a small corpus of text-only rationales. It demonstrates zero-shot generalization across image, video, audio, and molecular modalities without modality-specific supervision, matching or exceeding state-of-the-art commercial APIs and open-source judges. The approach relies on a $1K$-sample textual seed to fine-tune a vision-language model, with inference-time scaling strategies like majority voting to boost performance. The results highlight reasoning supervision as a scalable, cost-effective path for robust multimodal evaluation and downstream training via DPO, including a molecular case study with Flex-Mol-LLaMA.

Abstract

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

TL;DR

Flex-Judge introduces a reasoning-guided multimodal evaluator trained solely on a small corpus of text-only rationales. It demonstrates zero-shot generalization across image, video, audio, and molecular modalities without modality-specific supervision, matching or exceeding state-of-the-art commercial APIs and open-source judges. The approach relies on a -sample textual seed to fine-tune a vision-language model, with inference-time scaling strategies like majority voting to boost performance. The results highlight reasoning supervision as a scalable, cost-effective path for robust multimodal evaluation and downstream training via DPO, including a molecular case study with Flex-Mol-LLaMA.

Abstract

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Paper Structure

This paper contains 53 sections, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Conceptual overview of Flex-Judge. We train a multimodal judge model using a small amount of text-only reasoning data. Unlike previous approaches that require modality-specific supervision, Flex-Judge leverages structured text-only rationale behind judgments to enable generalization across modalities. Once trained, Flex-Judge can be applied to various evaluation tasks, including vision-language tasks, audio quality scoring, and molecular structure, without the need for additional task-specific or modality-specific annotations.
  • Figure 2: Comparisons on different perspectives of the seed dataset curation. All evaluations are done with our Flex-Judge, where its backbone model is Qwen2.5-VL-7B (refer to Section \ref{['subsec:experimental_setup']} for model details) and has been trained on the JudgeLRM-7B response data.
  • Figure 3: Reasoning process of Flex-Judge on the text-image alignment task (GenAI-Bench li2024genai). Additional qualitative examples are found in Appendix \ref{['appx:qualitative_examples']}.
  • Figure 4: (Left) Accuracy (%) trends on the parallel artificial membrane permeability assay (PAMPA; velez2024signals) task with different judgment scores. (Middle) Accuracy trends on the number of sampled responses in best-of-$N$ sampling. (Right) Performance comparison with reward-guided Mol-LLaMA. We report accuracy with prompt styles of default, CoT, and task information, as described in kim2025mol.
  • Figure 5: (Left) Performance comparison of Flex-Judge with and without reasoning. (Right) Relationship between the average reasoning length of Flex-VL-7B and the accuracy gain from reasoning over non-reasoning evaluation across subcategories in MLLM-as-a-Judge (Pair w. Tie).
  • ...and 12 more figures