Table of Contents
Fetching ...

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Wangyu Xue, Chen Qian, Jiayi Wu, Yang Zhou, Wentao Liu, Ju Ren, Siming Fan, Yaoxue Zhang

TL;DR

This work defines BestShot, a precise frame-level highlight retrieval task for human-centric videos guided by natural language. It introduces the BestShot Benchmark and two large-scale training datasets (ShotGPT4o and Image-SMPLText) to address data diversity and fine-grained descriptions, and proposes ShotVL, a strong zero-shot baseline derived from InternVL. ShotVL achieves substantial gains over prior models on BestShot and THUMOS14 while maintaining general image-text retrieval competence, and is further extended through ShotVL-Chat-LLaVa and ShotVL-LITA to explore video-language interactions. The study highlights the potential of combining frame-focused pose and action descriptions with vision-language models, while acknowledging data quality and domain coverage limitations and outlining avenues for future integration with video LLMs and larger, more diverse datasets.

Abstract

Existing works on human-centric video understanding typically focus on analyzing specific moment or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos via language queries. This task demands not only a deep semantic comprehension of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. %The benchmark is meticulously constructed by combining human detection and tracking, potential frame selection based on human judgment, and detailed textual descriptions crafted by human input to ensure precision. The benchmark is meticulously constructed by combining human-annotated highlight frames, detailed textual descriptions and duration labeling. These descriptions encompass three critical elements: (1) Visual content; (2) Fine-grained action; and (3) Human Pose Description. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, a dataset with large-scale and accurate per-frame pose description leveraging PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing SOTA models. ShotVL demonstrates a significant 52% improvement over InternVL on the BestShot Benchmark and a notable 57% improvement on the THUMOS14 Benchmark, all while maintaining the SOTA performance in general image classification and retrieval.

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

TL;DR

This work defines BestShot, a precise frame-level highlight retrieval task for human-centric videos guided by natural language. It introduces the BestShot Benchmark and two large-scale training datasets (ShotGPT4o and Image-SMPLText) to address data diversity and fine-grained descriptions, and proposes ShotVL, a strong zero-shot baseline derived from InternVL. ShotVL achieves substantial gains over prior models on BestShot and THUMOS14 while maintaining general image-text retrieval competence, and is further extended through ShotVL-Chat-LLaVa and ShotVL-LITA to explore video-language interactions. The study highlights the potential of combining frame-focused pose and action descriptions with vision-language models, while acknowledging data quality and domain coverage limitations and outlining avenues for future integration with video LLMs and larger, more diverse datasets.

Abstract

Existing works on human-centric video understanding typically focus on analyzing specific moment or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos via language queries. This task demands not only a deep semantic comprehension of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. %The benchmark is meticulously constructed by combining human detection and tracking, potential frame selection based on human judgment, and detailed textual descriptions crafted by human input to ensure precision. The benchmark is meticulously constructed by combining human-annotated highlight frames, detailed textual descriptions and duration labeling. These descriptions encompass three critical elements: (1) Visual content; (2) Fine-grained action; and (3) Human Pose Description. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, a dataset with large-scale and accurate per-frame pose description leveraging PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing SOTA models. ShotVL demonstrates a significant 52% improvement over InternVL on the BestShot Benchmark and a notable 57% improvement on the THUMOS14 Benchmark, all while maintaining the SOTA performance in general image classification and retrieval.

Paper Structure

This paper contains 19 sections, 22 figures, 12 tables.

Figures (22)

  • Figure 1: Example of BestShot Benchmark. The task is to locate the exact frame through language queries related to content, action stage and pose description. Each query may correspond to multiple intervals.
  • Figure 2: Zero-shot evaluation on BestShot, Temporal Action Localization (THUMOS14), Action Classification (AVA), and CLIP Classification and Retrieval Benchmark.
  • Figure 3: Annotation Pipeline of ShotGPT4o
  • Figure 4: (a) Example datasets with ground-truth SMPL annotation we used. (b) Difference of pose descriptions among Human-written, GPT-4o and Image-SMPLText.
  • Figure 5: Training and inference pipeline of ShotVL. (a) Training. (b) Inference of BestShot and Moment Retrieval.
  • ...and 17 more figures