Table of Contents
Fetching ...

CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

Vsevolod Kovalev, Parteek Kumar

TL;DR

This work tackles timestamped question answering for lecture videos under a single-GPU latency budget by introducing CourseTimeQA and a cross-modal retriever, CrossFusion-RAG. The approach uses frozen encoders with a learnable visual projection, limited cross-attention, a temporal-consistency loss, and a small reranker to achieve competitive retrieval metrics (MRR and nDCG) with end-to-end latency around 1.55 s on an A100. The paper provides a thorough evaluation against multiple baselines, analyzes robustness to ASR noise, and documents training, tuning, and reproducibility details to enable fair comparisons. Practically, the method enables faster re-watching, just-in-time concept reviews, and scalable clip curation for instructors within realistic hardware constraints.

Abstract

We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

TL;DR

This work tackles timestamped question answering for lecture videos under a single-GPU latency budget by introducing CourseTimeQA and a cross-modal retriever, CrossFusion-RAG. The approach uses frozen encoders with a learnable visual projection, limited cross-attention, a temporal-consistency loss, and a small reranker to achieve competitive retrieval metrics (MRR and nDCG) with end-to-end latency around 1.55 s on an A100. The paper provides a thorough evaluation against multiple baselines, analyzes robustness to ASR noise, and documents training, tuning, and reproducibility details to enable fair comparisons. Practically, the method enables faster re-watching, just-in-time concept reviews, and scalable clip curation for instructors within realistic hardware constraints.

Abstract

We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

Paper Structure

This paper contains 30 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: End-to-end CrossFusion-RAG pipeline. Fusion is query-agnostic and computed offline to produce 768-d segment vectors; online we do bi-encoder retrieval, light reranking, diversification, and grounded generation.
  • Figure 2: nDCG@k with 95% confidence intervals (LOOCV).
  • Figure 3: nDCG@10 vs. median latency for closest comparators (dev split).
  • Figure 4: Median end-to-end latency decomposition on an A100 80 GB GPU. Generation dominates; retrieval, reranking, and diversification are comparatively small.
  • Figure 5: Per-course modality contributions: relative nDCG@10 by course for text-only, CLIP zero-shot (image-only), and fused (CrossFusion) variants.