Table of Contents
Fetching ...

Multi-Objective Task-Aware Predictor for Image-Text Alignment

Eunki Kim, Na Min An, James Thorne, Hyunjung Shim

TL;DR

This work tackles the challenge of evaluating image-text alignment in a way that aligns with human judgments across multiple objectives and long input contexts. It introduces MULTI-TAP, a backbone-agnostic, scalar reward predictor built on LVLMs that first learns a single overall alignment score and then derives multi-objective scores via ridge regression on frozen embeddings. The authors also present EYE4ALL, a BLV-focused TI2T dataset with pairwise preferences and fine-grained scores across seven dimensions, to benchmark and train human-aligned evaluation tools. Empirical results show MULTI-TAP achieves superior correlation with human judgments, scales across diverse LVLM architectures, maintains strong performance with long contexts, and delivers improved inference efficiency over generative models and prior multi-objective predictors. Collectively, the work advances practical, interpretable, and accessible multimodal evaluation, providing both a robust predictor and a valuable benchmark for future research.

Abstract

Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.

Multi-Objective Task-Aware Predictor for Image-Text Alignment

TL;DR

This work tackles the challenge of evaluating image-text alignment in a way that aligns with human judgments across multiple objectives and long input contexts. It introduces MULTI-TAP, a backbone-agnostic, scalar reward predictor built on LVLMs that first learns a single overall alignment score and then derives multi-objective scores via ridge regression on frozen embeddings. The authors also present EYE4ALL, a BLV-focused TI2T dataset with pairwise preferences and fine-grained scores across seven dimensions, to benchmark and train human-aligned evaluation tools. Empirical results show MULTI-TAP achieves superior correlation with human judgments, scales across diverse LVLM architectures, maintains strong performance with long contexts, and delivers improved inference efficiency over generative models and prior multi-objective predictors. Collectively, the work advances practical, interpretable, and accessible multimodal evaluation, providing both a robust predictor and a valuable benchmark for future research.

Abstract

Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.

Paper Structure

This paper contains 33 sections, 10 figures, 21 tables.

Figures (10)

  • Figure 1: Comparison between existing image-text alignment metrics and ours. Our proposed predictor, applicable to different types of LVLMs, overcomes the challenges of conventional metrics in terms of (1) showing high correlations with human judgments, (2) understanding long input text sequences with detailed instructions, (3) reducing the inference time by returning precise scalar-based scores, and (4) enabling interpretable embeddings disentangled to multi-objective scores.
  • Figure 2: Schematic diagram of proposed MULTI-TAP architecture. At Stage 1, MULTI-TAP produces a scalar value reflecting image-text alignment by appending a reward head to the LVLM. For Stage 2, a ridge regression layer is added to the trained multimodal embeddings, generating scores across multiple aspects.
  • Figure 3: Performances of VisionREW-S (red) and our MULTI-TAP (blue) on multi-objective datasets. Our MULTI-TAPQwen-7B-S generally outperforms VisionREW-S (19B), achieving 34%p, 3%p, 53%p higher accuracies on VisionREW, EYE4ALLMulti-Binary, and Align-anything (T2I-Binary) datasets.
  • Figure 4: Sample screenshot of interface used in the human experiment. This annotation screen with a different image-request-response is shown 100 times per annotator.
  • Figure 5: Example of EYE4ALLMulti. EYE4ALLMulti comprises a text request, an image, model-generated responses, and scores across seven dimensions: Conciseness, Sufficiency, Safeness, Hallucination, Direction Accuracy, Depth Accuracy, and Overall Quality. These scores are normalized and averaged over 2--3 human annotators.
  • ...and 5 more figures