You Only Forward Once: An Efficient Compositional Judging Paradigm
Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Yunyun Yang
TL;DR
This work tackles the inefficiency and information loss in multimodal model judging by reframing the task as compositional, template-driven binary judgments. YOFO uses a single forward pass of a decoder-only MLLM to evaluate multiple requirements by reading final-token logits, achieving fast, interpretable judgments with dependency-aware capabilities and optional post-hoc CoT. Empirical results show state-of-the-art performance on recommendation-style tasks, strong cross-domain generalization to fashion data, and significant throughput gains over autoregressive or multi-stage rerankers. The approach opens avenues for fine-grained, instruction-aligned judgments in real-time systems and suggests potential extensions to reinforcement learning signals and multi-label applications.
Abstract
Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis -- where subsequent judgments are conditioned on previous ones -- and further benefits from post-hoc CoT.
