Table of Contents
Fetching ...

Storyboard guided Alignment for Fine-grained Video Action Recognition

Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, Liu Liu

TL;DR

Inspired by the concept of storyboarding, which disassembles a script into individual shots, global video semantics is enhanced by generating fine-grained descriptions using a pre-trained large language model that capture common atomic actions depicted in videos.

Abstract

Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions depicted in videos. A filtering metric is proposed to select the descriptions that correspond to the atomic actions present in both the videos and the descriptions. By employing global semantics and fine-grained descriptions, we can identify key frames in videos and utilize them to aggregate embeddings, thereby making the embedding more accurate. Extensive experiments on various video action recognition datasets demonstrate superior performance of our proposed method in supervised, few-shot, and zero-shot settings.

Storyboard guided Alignment for Fine-grained Video Action Recognition

TL;DR

Inspired by the concept of storyboarding, which disassembles a script into individual shots, global video semantics is enhanced by generating fine-grained descriptions using a pre-trained large language model that capture common atomic actions depicted in videos.

Abstract

Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions depicted in videos. A filtering metric is proposed to select the descriptions that correspond to the atomic actions present in both the videos and the descriptions. By employing global semantics and fine-grained descriptions, we can identify key frames in videos and utilize them to aggregate embeddings, thereby making the embedding more accurate. Extensive experiments on various video action recognition datasets demonstrate superior performance of our proposed method in supervised, few-shot, and zero-shot settings.

Paper Structure

This paper contains 13 sections, 11 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: An overview of our framework for video action recognition. We extend CLIP for classifying the video $\{\mathbf{I}_{\text{l}}\}_{\text{l}=1}^{L}$ with $L$ frames by computing a video embedding from frame embedding $\{\mathbf{v}_{\text{l}}^\text{cls}\}_{\text{l}=1}^{L}$ extracted with a visual encoder of CLIP in three steps. i) We decompose the global text prompt $\mathbf{T}_{\text{c}}$ that describes the class semantic of action $\text{c}$ into descriptions of atomic action (i.e., sub-text prompts) by using a pre-trained large language model. The global text prompt $\mathbf{T}_{\text{c}}$ and sub-text prompts are then embedded by the textual encoder of CLIP for extracting embeddings of $\mathcal{T}_{\text{c}}$ and $\{\mathcal{S}_{\text{c}, \text{n}}\}_{\text{l}=1}^{L}$. ii) A coarse video embedding is extracted by augmenting the global text embedding $\mathcal{T}_{\text{c}}$ with the sub-text embedding $\{\mathcal{S}_{\text{c}, \text{n}}\}_{\text{l}=1}^{L}$, calculating coarse importance of the frame embedding $\{\mathbf{v}_{\text{l}}^\text{cls}\}_{\text{l}=1}^{L}$ with the augmented global text embedding $\hat{\mathcal{T}}_{\text{c}}$, and using the importance $\{a_{\text{l}}\}_{\text{l}=1}^{L}$ to aggregate a coarse video embedding $\mathbf{o}^\text{coarse}$ from frame embedding $\{\mathbf{v}_{\text{l}}^\text{cls}\}_{\text{l}=1}^{L}$. iii) Similar to the coarse video embedding, we get a fine-grained video embedding $\mathbf{o}^\text{fine}$ by calculating fine-grained importance $\{a_{\text{l}}^\text{fine}\}_{\text{l}=1}^{L}$ of the frame embedding $\{\mathbf{v}_{\text{l}}^\text{cls}\}_{\text{l}=1}^{L}$ with sub-text embedding $\{\mathcal{S}_{\text{c}, \text{n}}\}_{\text{l}=1}^{L}$ for aggregating frame embedding $\{\mathbf{v}_{\text{l}}^\text{cls}\}_{\text{l}=1}^{L}$. The coarse video embedding $\mathbf{o}^\text{coarse}$ and fine-grained video embedding $\mathbf{o}^\text{fine}$ are fused to video embedding $\mathbf{o}$ for action recognition.
  • Figure 2: Analysis of sub-texts. We study (a) the impact of the number of sub-texts on action recognition performance, and (b) the correlation of TPP (average of $\text{TPP}_{\text{c}}$ for all actions) and action recognition performance, with the fitted dashed line showing an $r^2$ value of 0.79.