HLTCOE Evaluation Team at TREC 2025: VQA Track

Dengjia Zhang; Charles Weng; Katherine Guerrerio; Yi Lu; Kenton Murray; Alexander Martin; Reno Kriz; Benjamin Van Durme

HLTCOE Evaluation Team at TREC 2025: VQA Track

Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme

TL;DR

The paper tackles the challenge of generating multiple plausible answers for video-question pairs with stable rankings in the VQA setting. It introduces a two-stage generate-and-rerank framework, where a base multimodal generator creates diverse candidate answers via prompt variation, and a reranker applies a novel Masked Pointer Cross-Entropy Loss with Rank Weights to learn listwise rankings. Key contributions include the Rank-Labeled Candidate Representation, the Masked Pointer Cross-Entropy objective, and a comprehensive set of experiments showing improved accuracy and ranking stability, especially for temporally-reasoned and semantically ambiguous questions. The work provides a generalizable approach to integrate generative and discriminative ranking in multimodal tasks and demonstrates the practical benefits of careful prompt design and rank-aware supervision.

Abstract

The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

HLTCOE Evaluation Team at TREC 2025: VQA Track

TL;DR

Abstract

HLTCOE Evaluation Team at TREC 2025: VQA Track

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)