ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Wangyu Xue; Chen Qian; Jiayi Wu; Yang Zhou; Wentao Liu; Ju Ren; Siming Fan; Yaoxue Zhang

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Wangyu Xue, Chen Qian, Jiayi Wu, Yang Zhou, Wentao Liu, Ju Ren, Siming Fan, Yaoxue Zhang

TL;DR

This work defines BestShot, a precise frame-level highlight retrieval task for human-centric videos guided by natural language. It introduces the BestShot Benchmark and two large-scale training datasets (ShotGPT4o and Image-SMPLText) to address data diversity and fine-grained descriptions, and proposes ShotVL, a strong zero-shot baseline derived from InternVL. ShotVL achieves substantial gains over prior models on BestShot and THUMOS14 while maintaining general image-text retrieval competence, and is further extended through ShotVL-Chat-LLaVa and ShotVL-LITA to explore video-language interactions. The study highlights the potential of combining frame-focused pose and action descriptions with vision-language models, while acknowledging data quality and domain coverage limitations and outlining avenues for future integration with video LLMs and larger, more diverse datasets.

Abstract

Existing works on human-centric video understanding typically focus on analyzing specific moment or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos via language queries. This task demands not only a deep semantic comprehension of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. %The benchmark is meticulously constructed by combining human detection and tracking, potential frame selection based on human judgment, and detailed textual descriptions crafted by human input to ensure precision. The benchmark is meticulously constructed by combining human-annotated highlight frames, detailed textual descriptions and duration labeling. These descriptions encompass three critical elements: (1) Visual content; (2) Fine-grained action; and (3) Human Pose Description. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, a dataset with large-scale and accurate per-frame pose description leveraging PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing SOTA models. ShotVL demonstrates a significant 52% improvement over InternVL on the BestShot Benchmark and a notable 57% improvement on the THUMOS14 Benchmark, all while maintaining the SOTA performance in general image classification and retrieval.

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

TL;DR

Abstract

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)