Table of Contents
Fetching ...

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

TL;DR

This work introduces Video Action Differencing (VidDiff), a zero-shot task to identify fine-grained differences between two videos of the same action and benchmarks it with VidDiffBench, a dataset of 549 video pairs, 4,469 differences, and 2,075 localization timestamps across diverse domains. The authors propose VidDiff, a three-stage agentic workflow that uses an LLM for difference proposals, CLIP-based localization, and vision-language models for frame-level differencing, achieving improved open- and closed-set performance over strong baselines. Their analysis parses the core challenges—precise sub-action localization and subtle frame-to-frame differences—and provides extensive ablations and error analyses to guide future improvements. The work, including dataset and code releases, offers a foundation for research in fine-grained, language-grounded video analysis with broad applications in coaching, sports analytics, medicine, and education.

Abstract

How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.

Video Action Differencing

TL;DR

This work introduces Video Action Differencing (VidDiff), a zero-shot task to identify fine-grained differences between two videos of the same action and benchmarks it with VidDiffBench, a dataset of 549 video pairs, 4,469 differences, and 2,075 localization timestamps across diverse domains. The authors propose VidDiff, a three-stage agentic workflow that uses an LLM for difference proposals, CLIP-based localization, and vision-language models for frame-level differencing, achieving improved open- and closed-set performance over strong baselines. Their analysis parses the core challenges—precise sub-action localization and subtle frame-to-frame differences—and provides extensive ablations and error analyses to guide future improvements. The work, including dataset and code releases, offers a foundation for research in fine-grained, language-grounded video analysis with broad applications in coaching, sports analytics, medicine, and education.

Abstract

How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.

Paper Structure

This paper contains 68 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: The Video Action Differencing task and benchmark (VidDiffBench). Given a pair of videos and an action, the task is to generate a list of differences as natural language descriptions. Our VidDiffBench consists of annotated differences across diverse domains, where the differences are relevant to human skill learning. The first row emphasizes the first key challenge: localization of sub-actions between segments of the video for comparison. The second row highlights the second key challenge: fine-grained image understanding of actions in order to perform comparison.
  • Figure 2: VidDiff Method. One input is an action description (e.g. "weighted squat"). The Difference Proposer generates potential differences using a large language model (LLM). The Frame Localizer assigns frames where these differences are observable. Finally, the Action Differencer checks each difference using a vision-language model, determining whether it applies more to video A or video B, or neither.
  • Figure 3: Examples of 'success cases' (left) -- differences where GPT-4o has high accuracy -- and failure cases (right). Success cases typically involve coarse differences, easy localization, or simple actions, while failure cases often involve fine differences, precise localization or complex actions.
  • Figure 4: Sample frame localizations: prediction vs ground truth.