Multimodal Language Models for Domain-Specific Procedural Video Summarization
Nafisa Hussain
TL;DR
This work investigates domain-specific procedural video summarization using multimodal large language models. By fine-tuning TimeChat on cooking (Tasty) and medical (MedVidQA) data and converting them into instruction-following formats with GPT-4, the approach aims to generate concise stepwise descriptions and ground events with precise timestamps for long tutorials. Results indicate that domain-focused fine-tuning improves instruction extraction and step localization, with measurable gains on YouCook2, MedVidQA, and Tasty-derived tasks, though improvements are domain-dependent. The study demonstrates the practical potential of specialized multimodal models to provide targeted, step-by-step guidance across domains, paving the way for time-aware video assistants in education and training.
Abstract
Videos serve as a powerful medium to convey ideas, tell stories, and provide detailed instructions, especially through long-format tutorials. Such tutorials are valuable for learning new skills at one's own pace, yet they can be overwhelming due to their length and dense content. Viewers often seek specific information, like precise measurements or step-by-step execution details, making it essential to extract and summarize key segments efficiently. An intelligent, time-sensitive video assistant capable of summarizing and detecting highlights in long videos is highly sought after. Recent advancements in Multimodal Large Language Models offer promising solutions to develop such an assistant. Our research explores the use of multimodal models to enhance video summarization and step-by-step instruction generation within specific domains. These models need to understand temporal events and relationships among actions across video frames. Our approach focuses on fine-tuning TimeChat to improve its performance in specific domains: cooking and medical procedures. By training the model on domain-specific datasets like Tasty for cooking and MedVidQA for medical procedures, we aim to enhance its ability to generate concise, accurate summaries of instructional videos. We curate and restructure these datasets to create high-quality video-centric instruction data. Our findings indicate that when finetuned on domain-specific procedural data, TimeChat can significantly improve the extraction and summarization of key instructional steps in long-format videos. This research demonstrates the potential of specialized multimodal models to assist with practical tasks by providing personalized, step-by-step guidance tailored to the unique aspects of each domain.
