Table of Contents
Fetching ...

You Only Forward Once: An Efficient Compositional Judging Paradigm

Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Yunyun Yang

TL;DR

This work tackles the inefficiency and information loss in multimodal model judging by reframing the task as compositional, template-driven binary judgments. YOFO uses a single forward pass of a decoder-only MLLM to evaluate multiple requirements by reading final-token logits, achieving fast, interpretable judgments with dependency-aware capabilities and optional post-hoc CoT. Empirical results show state-of-the-art performance on recommendation-style tasks, strong cross-domain generalization to fashion data, and significant throughput gains over autoregressive or multi-stage rerankers. The approach opens avenues for fine-grained, instruction-aligned judgments in real-time systems and suggests potential extensions to reinforcement learning signals and multi-label applications.

Abstract

Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis -- where subsequent judgments are conditioned on previous ones -- and further benefits from post-hoc CoT.

You Only Forward Once: An Efficient Compositional Judging Paradigm

TL;DR

This work tackles the inefficiency and information loss in multimodal model judging by reframing the task as compositional, template-driven binary judgments. YOFO uses a single forward pass of a decoder-only MLLM to evaluate multiple requirements by reading final-token logits, achieving fast, interpretable judgments with dependency-aware capabilities and optional post-hoc CoT. Empirical results show state-of-the-art performance on recommendation-style tasks, strong cross-domain generalization to fashion data, and significant throughput gains over autoregressive or multi-stage rerankers. The approach opens avenues for fine-grained, instruction-aligned judgments in real-time systems and suggests potential extensions to reinforcement learning signals and multi-label applications.

Abstract

Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis -- where subsequent judgments are conditioned on previous ones -- and further benefits from post-hoc CoT.

Paper Structure

This paper contains 18 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An example illustrating a limitation of Jina-Reranker-M0. The user seeks a midi or long dress in a non-pink color for an evening event, yet the model ranks a short pink dress higher than a long black dress that is well-suited for such an occasion.
  • Figure 2: Comparisons on Different Information Matching Methods. (a) embeds inputs and then calculate similarity. (b) adapts an MLLM to output a single relevance score. (c) reformulates the matching problem as an autoregressive task, outputing relevance analysis. (d) Our method first decompose the text into a set of fundamental requirements and prefills them into a predefined template. Then the MLLM takes the template and the image as input and output whether each requirement is satisfied after a single forward pass.
  • Figure 3: Architecture of YOFO. (Left): YOFO uses an MLLM as its backbone. During training, the MLLM receives a structured requirement template as input and is trained to (i) judge each requirement and (ii) produce supporting reasons. YOFO supervises the model to output correct yes/no judgments while applying next-token prediction only to the reasoning text. (Right): At inference time, the user’s query is first decomposed into a structured template, and the MLLM ingests the image and the template to determine, in a single forward pass, whether each requirement is satisfied, without autoregressively generating reasons.
  • Figure 4: Examples of the training and test sets. Each training sample consists of an image, a set of properties, answers, and corresponding reasons. Each test sample consists of two images, a customer-style query, the corresponding Python variables, ground-truth labels, and the expression used to compute each image's final recommendation score.
  • Figure 5: Prompts for dataset construction. (a) We prompt an MLLM to propose ten properties of a randomly selected image, with an equal number of satisfied and unsatisfied properties. (b) We prompt an MLLM to generate a customer-style query for an image pair, indicate whether the first image satisfies the query better than the second, decompose the query into a set of requirements, and produce an expression used to compute each image’s final recommendation score during YOFO evaluation.
  • ...and 2 more figures