Towards Flexible Evaluation for Generative Visual Question Answering
Huishan Ji, Qingyi Si, Zheng Lin, Weiping Wang
TL;DR
The paper tackles the challenge of fairly evaluating open-ended visual question answering by introducing semantics-based evaluators. It proposes three quantitative properties—Alignment, Consistency, and Generalization—and the AVE dataset to analyze evaluator behavior, then introduces SFVE, a contrastive learning-based evaluator trained with carefully designed pretraining tasks. Empirical results show SFVE significantly surpasses traditional formulaic metrics, standard embedding models, and even some LLM-based evaluations, with performance generalizing to encoder-only and decoder-only architectures. The work enables flexible, human-aligned assessment of multimodal models, facilitating fair comparisons and practical deployment in VQA research and development.
Abstract
Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation. Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM.
