Towards Flexible Evaluation for Generative Visual Question Answering

Huishan Ji; Qingyi Si; Zheng Lin; Weiping Wang

Towards Flexible Evaluation for Generative Visual Question Answering

Huishan Ji, Qingyi Si, Zheng Lin, Weiping Wang

TL;DR

The paper tackles the challenge of fairly evaluating open-ended visual question answering by introducing semantics-based evaluators. It proposes three quantitative properties—Alignment, Consistency, and Generalization—and the AVE dataset to analyze evaluator behavior, then introduces SFVE, a contrastive learning-based evaluator trained with carefully designed pretraining tasks. Empirical results show SFVE significantly surpasses traditional formulaic metrics, standard embedding models, and even some LLM-based evaluations, with performance generalizing to encoder-only and decoder-only architectures. The work enables flexible, human-aligned assessment of multimodal models, facilitating fair comparisons and practical deployment in VQA research and development.

Abstract

Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation. Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM.

Towards Flexible Evaluation for Generative Visual Question Answering

TL;DR

Abstract

Paper Structure (53 sections, 7 equations, 5 figures, 3 tables)

This paper contains 53 sections, 7 equations, 5 figures, 3 tables.

Introduction
Related Work
Visual Question Answering
Semantic Textual Similarity
Multimodal Comprehension Evaluation of MLLMs
Semantic Evaluation of VQA
Characteristics of VQA Evaluation
Discrimination Granularity
Text Length
Distribution Shift
Three Key Properties in VQA Evaluation
Alignment
Consistency
Generalization
A Dataset Assessing VQA Evaluators
...and 38 more sections

Figures (5)

Figure 1: Responses from four MLLMs on a simple visual question. The responses are different in length, styles and complexity, which can all be considered correct but none of them exactly matches the annotated answer.
Figure 2: The construction procedure of AVE. After randomly sampled from the outputs of models, each sample is manually annotated with a score and automatically augmented by generated descriptions and a variation on the answer word while remaining almost the same correctness as a VQA response. Part 1 to 3 denote different augmentation methods.
Figure 3: Framework of contrastive learning in the proposed Semantically Flexible VQA Evaluator (SFVE). The original sample is augmented into two variations and form a positive pair and a negative pair. The example in the figure shows the procedure of the pretraining task Generated descriptions. In the positive pair, the semantics of the sentence is considered same as the original, while in the negative pair, as the answer word is replaced with a random answer, the sentence contains unmatched meaning with the original.
Figure 4: Cases for analysis. The samples come from the open-ended part of A-OKVQA schwenk2022aokvqa validation set. The first row comes from results of SFVE-large and SBERT, and the second comes from SFVE-large and BGE.
Figure 5: The annotated scores distribution of AVE.

Towards Flexible Evaluation for Generative Visual Question Answering

TL;DR

Abstract

Towards Flexible Evaluation for Generative Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (5)