Table of Contents
Fetching ...

TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs

Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao

TL;DR

TokenFocus-VQA tackles the challenge of fine-grained text-to-image alignment evaluation by moving beyond global similarity metrics to token-level, position-aware supervision within a large vision-language model (LVLM)–driven VQA framework. By focusing loss on the first generated token corresponding to crucial semantic elements and mapping its probability distribution to a numeric score via an expected-value calculation, the method achieves finer-grained alignment measurements. The framework further enhances robustness through external structural information prompting and a hierarchical ensemble (bagging, stacking, blending) that aggregates multi-perspective assessments across diverse LVLM architectures. Empirical results on NTIRE 2025 Track 1 and EvalMuse-40K demonstrate state-of-the-art or competitive performance, with substantial gains in both holistic alignment metrics (SRCC, PLCC) and element-level accuracy, highlighting the practical impact for syntheses quality assessment and model refinement. The work lays groundwork for more expressive, token-aware evaluation pipelines and suggests future directions in dynamic vocabulary adaptation and deeper cross-modal reasoning components.

Abstract

While text-to-image (T2I) generation models have achieved remarkable progress in recent years, existing evaluation methodologies for vision-language alignment still struggle with the fine-grained semantic matching. Current approaches based on global similarity metrics often overlook critical token-level correspondences between textual descriptions and visual content. To this end, we present TokenFocus-VQA, a novel evaluation framework that leverages Large Vision-Language Models (LVLMs) through visual question answering (VQA) paradigm with position-specific probability optimization. Our key innovation lies in designing a token-aware loss function that selectively focuses on probability distributions at pre-defined vocabulary positions corresponding to crucial semantic elements, enabling precise measurement of fine-grained semantical alignment. The proposed framework further integrates ensemble learning techniques to aggregate multi-perspective assessments from diverse LVLMs architectures, thereby achieving further performance enhancement. Evaluated on the NTIRE 2025 T2I Quality Assessment Challenge Track 1, our TokenFocus-VQA ranks 2nd place (0.8445, only 0.0001 lower than the 1st method) on public evaluation and 2nd place (0.8426) on the official private test set, demonstrating superiority in capturing nuanced text-image correspondences compared to conventional evaluation methods.

TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs

TL;DR

TokenFocus-VQA tackles the challenge of fine-grained text-to-image alignment evaluation by moving beyond global similarity metrics to token-level, position-aware supervision within a large vision-language model (LVLM)–driven VQA framework. By focusing loss on the first generated token corresponding to crucial semantic elements and mapping its probability distribution to a numeric score via an expected-value calculation, the method achieves finer-grained alignment measurements. The framework further enhances robustness through external structural information prompting and a hierarchical ensemble (bagging, stacking, blending) that aggregates multi-perspective assessments across diverse LVLM architectures. Empirical results on NTIRE 2025 Track 1 and EvalMuse-40K demonstrate state-of-the-art or competitive performance, with substantial gains in both holistic alignment metrics (SRCC, PLCC) and element-level accuracy, highlighting the practical impact for syntheses quality assessment and model refinement. The work lays groundwork for more expressive, token-aware evaluation pipelines and suggests future directions in dynamic vocabulary adaptation and deeper cross-modal reasoning components.

Abstract

While text-to-image (T2I) generation models have achieved remarkable progress in recent years, existing evaluation methodologies for vision-language alignment still struggle with the fine-grained semantic matching. Current approaches based on global similarity metrics often overlook critical token-level correspondences between textual descriptions and visual content. To this end, we present TokenFocus-VQA, a novel evaluation framework that leverages Large Vision-Language Models (LVLMs) through visual question answering (VQA) paradigm with position-specific probability optimization. Our key innovation lies in designing a token-aware loss function that selectively focuses on probability distributions at pre-defined vocabulary positions corresponding to crucial semantic elements, enabling precise measurement of fine-grained semantical alignment. The proposed framework further integrates ensemble learning techniques to aggregate multi-perspective assessments from diverse LVLMs architectures, thereby achieving further performance enhancement. Evaluated on the NTIRE 2025 T2I Quality Assessment Challenge Track 1, our TokenFocus-VQA ranks 2nd place (0.8445, only 0.0001 lower than the 1st method) on public evaluation and 2nd place (0.8426) on the official private test set, demonstrating superiority in capturing nuanced text-image correspondences compared to conventional evaluation methods.

Paper Structure

This paper contains 14 sections, 5 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Actual use case demonstration of the EvalMuse-40K in the NTIRE 2025 Challenge. Different types of elements are marked with special colors (i.e., for object elements, for action elements, and for item attributes). The total score is classified into 1-5, and the element-level score is 0 and 1. The values shown in the tables above are the averaged results of three or six annotators.
  • Figure 2: The overall framework of our proposed TokenFocus-VQA, which is proposed for LVLMs-based T2I alignment accessment at both the holistic and fine-grained levels. The visual encoding process begins with transforming input images into the visual tokens via a vision encoder. For distinct scoring tasks (i.e., Total Score & Element Score), we construct task-specific input prompts augmented with the structured meta-data. These multimodal tokens are then jointly processed in the large language decoder (i.e., InternLM cai2024internlm2technicalreport and Qwen2.5 qwen2025qwen25technicalreport) for the generative score prediction. The framework is ultimately refined through our proposed Position-Aware Token-Focused Optimization method for further performance gains.
  • Figure 3: The overall illustration of our ensemble training and inference workflow.