Table of Contents
Fetching ...

HLTCOE Evaluation Team at TREC 2025: VQA Track

Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme

TL;DR

The paper tackles the challenge of generating multiple plausible answers for video-question pairs with stable rankings in the VQA setting. It introduces a two-stage generate-and-rerank framework, where a base multimodal generator creates diverse candidate answers via prompt variation, and a reranker applies a novel Masked Pointer Cross-Entropy Loss with Rank Weights to learn listwise rankings. Key contributions include the Rank-Labeled Candidate Representation, the Masked Pointer Cross-Entropy objective, and a comprehensive set of experiments showing improved accuracy and ranking stability, especially for temporally-reasoned and semantically ambiguous questions. The work provides a generalizable approach to integrate generative and discriminative ranking in multimodal tasks and demonstrates the practical benefits of careful prompt design and rank-aware supervision.

Abstract

The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

HLTCOE Evaluation Team at TREC 2025: VQA Track

TL;DR

The paper tackles the challenge of generating multiple plausible answers for video-question pairs with stable rankings in the VQA setting. It introduces a two-stage generate-and-rerank framework, where a base multimodal generator creates diverse candidate answers via prompt variation, and a reranker applies a novel Masked Pointer Cross-Entropy Loss with Rank Weights to learn listwise rankings. Key contributions include the Rank-Labeled Candidate Representation, the Masked Pointer Cross-Entropy objective, and a comprehensive set of experiments showing improved accuracy and ranking stability, especially for temporally-reasoned and semantically ambiguous questions. The work provides a generalizable approach to integrate generative and discriminative ranking in multimodal tasks and demonstrates the practical benefits of careful prompt design and rank-aware supervision.

Abstract

The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

Paper Structure

This paper contains 15 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of the proposed generate–rank framework. The framework comprises two key components: a Generator, which produces a set of candidate responses given an input prompt, and a Reranker, which assigns relevance scores to these candidates and orders them accordingly. (a) Inference Pipeline: The generator outputs multiple candidate responses, which are subsequently reordered by the reranker based on their predicted relevance to the query. (b) Reranker Training: The reranker is trained using ground-truth rankings, where token-level weighting is incorporated into the loss function to emphasize more informative tokens. Darker colors denote higher token weights, indicating a greater contribution to the overall loss.
  • Figure 2: Prompt1
  • Figure 3: Prompt2
  • Figure 4: Prompt3
  • Figure 5: Prompt4
  • ...and 3 more figures