TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao
TL;DR
TokenFocus-VQA tackles the challenge of fine-grained text-to-image alignment evaluation by moving beyond global similarity metrics to token-level, position-aware supervision within a large vision-language model (LVLM)–driven VQA framework. By focusing loss on the first generated token corresponding to crucial semantic elements and mapping its probability distribution to a numeric score via an expected-value calculation, the method achieves finer-grained alignment measurements. The framework further enhances robustness through external structural information prompting and a hierarchical ensemble (bagging, stacking, blending) that aggregates multi-perspective assessments across diverse LVLM architectures. Empirical results on NTIRE 2025 Track 1 and EvalMuse-40K demonstrate state-of-the-art or competitive performance, with substantial gains in both holistic alignment metrics (SRCC, PLCC) and element-level accuracy, highlighting the practical impact for syntheses quality assessment and model refinement. The work lays groundwork for more expressive, token-aware evaluation pipelines and suggests future directions in dynamic vocabulary adaptation and deeper cross-modal reasoning components.
Abstract
While text-to-image (T2I) generation models have achieved remarkable progress in recent years, existing evaluation methodologies for vision-language alignment still struggle with the fine-grained semantic matching. Current approaches based on global similarity metrics often overlook critical token-level correspondences between textual descriptions and visual content. To this end, we present TokenFocus-VQA, a novel evaluation framework that leverages Large Vision-Language Models (LVLMs) through visual question answering (VQA) paradigm with position-specific probability optimization. Our key innovation lies in designing a token-aware loss function that selectively focuses on probability distributions at pre-defined vocabulary positions corresponding to crucial semantic elements, enabling precise measurement of fine-grained semantical alignment. The proposed framework further integrates ensemble learning techniques to aggregate multi-perspective assessments from diverse LVLMs architectures, thereby achieving further performance enhancement. Evaluated on the NTIRE 2025 T2I Quality Assessment Challenge Track 1, our TokenFocus-VQA ranks 2nd place (0.8445, only 0.0001 lower than the 1st method) on public evaluation and 2nd place (0.8426) on the official private test set, demonstrating superiority in capturing nuanced text-image correspondences compared to conventional evaluation methods.
