PolySmart @ TRECVid 2024 Medical Video Question Answering
Jiaxin Wu, Yiyang Jiang, Xiao-Yong Wei, Qing Li
TL;DR
This work tackles medical video question answering by focusing on two tasks: VCVAL (Video Corpus Visual Answer Localization) and QFISC (Query-Focused Instructional Step Captioning). It introduces a two-stage VCVAL pipeline combining text-to-text retrieval based on transcripts (augmented by GPT-4 generated QA content) and a multi-modal localization framework with I3D visual features, DeBERTa textual representations, cross-modal fusion, and adaptive knowledge transfer via a lookup table and IoU-guided loss. For QFISC, it uses LLaVA NEXT 32B to generate initial captions and GPT-4 to produce time-aligned, instructional steps, resulting in structured step captions. Experimental results show competitive MAPs for VCVAL (e.g., 0.1401) and a GPT-4–driven F-score of 11.92 with mean IoU 9.6527 for QFISC, demonstrating the effectiveness of multimodal fusion in medical video QA while noting limitations in boundary precision and terminology handling.
Abstract
Video Corpus Visual Answer Localization (VCVAL) includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. For the Query-Focused Instructional Step Captioning (QFISC) task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.
