HLTCOE Evaluation Team at TREC 2025: VQA Track
Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme
TL;DR
The paper tackles the challenge of generating multiple plausible answers for video-question pairs with stable rankings in the VQA setting. It introduces a two-stage generate-and-rerank framework, where a base multimodal generator creates diverse candidate answers via prompt variation, and a reranker applies a novel Masked Pointer Cross-Entropy Loss with Rank Weights to learn listwise rankings. Key contributions include the Rank-Labeled Candidate Representation, the Masked Pointer Cross-Entropy objective, and a comprehensive set of experiments showing improved accuracy and ranking stability, especially for temporally-reasoned and semantically ambiguous questions. The work provides a generalizable approach to integrate generative and discriminative ranking in multimodal tasks and demonstrates the practical benefits of careful prompt design and rank-aware supervision.
Abstract
The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
