Table of Contents
Fetching ...

Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track

Deepak Gupta, Dina Demner-Fushman

TL;DR

MedVidQA 2024 introduces two medical video QA tasks—VCVAL for video retrieval and visual-answer localization, and QFISC for query-focused instructional step captioning—to advance multimodal understanding and generation from medical videos. It leverages a substantial medical instructional-video corpus ($48{,}605$ videos) and HIREST-based data to train and evaluate across retrieval, localization, and captioning, using a mix of automatic metrics ($MAP$, $nDCG$, IoU-based measures) and generation/semantic metrics (BLEU, ROUGE, METEOR, SPICE, BERTScore) plus human judgments. Results across seven teams reveal competitive performance (e.g., best $nDCG$ in VR and best $IoU$ in VAL) and highlight the benefits and trade-offs of different evaluation choices and thresholds. The work provides ground-truth judgments, baselines (e.g., BM25), and a reproducible framework to foster research in medical language–video understanding and instructional content generation with practical implications for education and clinical decision support.

Abstract

One of the key goals of artificial intelligence (AI) is the development of a multimodal system that facilitates communication with the visual world (image and video) using a natural language query. Earlier works on medical question answering primarily focused on textual and visual (image) modalities, which may be inefficient in answering questions requiring demonstration. In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning visual question answering, and natural language video localization. Most of the existing work on language vision focused on creating datasets and developing solutions for open-domain applications. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. With increasing interest in AI to support clinical decision-making and improve patient engagement, there is a need to explore such challenges and develop efficient algorithms for medical language-video understanding and generation. Toward this, we introduced new tasks to foster research toward designing systems that can understand medical videos to provide visual answers to natural language questions, and are equipped with multimodal capability to generate instruction steps from the medical video. These tasks have the potential to support the development of sophisticated downstream applications that can benefit the public and medical professionals.

Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track

TL;DR

MedVidQA 2024 introduces two medical video QA tasks—VCVAL for video retrieval and visual-answer localization, and QFISC for query-focused instructional step captioning—to advance multimodal understanding and generation from medical videos. It leverages a substantial medical instructional-video corpus ( videos) and HIREST-based data to train and evaluate across retrieval, localization, and captioning, using a mix of automatic metrics (, , IoU-based measures) and generation/semantic metrics (BLEU, ROUGE, METEOR, SPICE, BERTScore) plus human judgments. Results across seven teams reveal competitive performance (e.g., best in VR and best in VAL) and highlight the benefits and trade-offs of different evaluation choices and thresholds. The work provides ground-truth judgments, baselines (e.g., BM25), and a reproducible framework to foster research in medical language–video understanding and instructional content generation with practical implications for education and clinical decision support.

Abstract

One of the key goals of artificial intelligence (AI) is the development of a multimodal system that facilitates communication with the visual world (image and video) using a natural language query. Earlier works on medical question answering primarily focused on textual and visual (image) modalities, which may be inefficient in answering questions requiring demonstration. In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning visual question answering, and natural language video localization. Most of the existing work on language vision focused on creating datasets and developing solutions for open-domain applications. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. With increasing interest in AI to support clinical decision-making and improve patient engagement, there is a need to explore such challenges and develop efficient algorithms for medical language-video understanding and generation. Toward this, we introduced new tasks to foster research toward designing systems that can understand medical videos to provide visual answers to natural language questions, and are equipped with multimodal capability to generate instruction steps from the medical video. These tasks have the potential to support the development of sophisticated downstream applications that can benefit the public and medical professionals.

Paper Structure

This paper contains 19 sections, 2 figures, 7 tables, 2 algorithms.

Figures (2)

  • Figure 3: Annotation Guidelines for manual step captioning of QFISC task.
  • Figure 4: Human assessment guidelines for evaluating the system generated steps in QFISC task.