Table of Contents
Fetching ...

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Reza Ghoddoosian, Nakul Agarwal, Isht Dwivedi, Behzad Darisuh

TL;DR

A simple fine-tuning technique, Action Concept Enhancement (ACE), is proposed to improve the robustness and concept understanding of VLMs in procedural action classification and shows the enhanced concept understanding of the authors' VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space.

Abstract

Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

TL;DR

A simple fine-tuning technique, Action Concept Enhancement (ACE), is proposed to improve the robustness and concept understanding of VLMs in procedural action classification and shows the enhanced concept understanding of the authors' VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space.

Abstract

Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of the similarity between video and text representations for three action classes (concepts). Thicker lines indicate more similarity. Baseline VLMs (left) struggle with action synonym robustness. In contrast, ACE (right), improves accuracy in matching videos to action concepts, regardless of synonyms.
  • Figure 2: Synonym trees for the action verbs 'fasten' and 'insert' and sample notations. Each tree represents an action concept, with replicated parent nodes highlighted in bold. Some second-order synonyms provide broader descriptions of the action.
  • Figure 3: Impact of the quantity of augmented synonyms on mean and std (shaded area) for novel and base actions of the ATA dataset.
  • Figure 4: Impact of fine-tuning various layers of video-text encoders on the mean F1 score. Results on ATA and split 1 of IKEA.
  • Figure 5: TSNE visualization for synonym embeddings in seen (outlined in green) and unseen (outlined in red) action spaces of IKEA dataset. The original CLIP embeddings of action synonyms are ACEd and grouped more distinctly. Please zoom in to see finer details.