Table of Contents
Fetching ...

Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani

TL;DR

A training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency is introduced.

Abstract

Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.

Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

TL;DR

A training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency is introduced.

Abstract

Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.
Paper Structure (30 sections, 9 equations, 5 figures, 10 tables)

This paper contains 30 sections, 9 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Problem Setup: Existing approaches in \ref{['fig:comparison_a']} are fixed vocabulary and do not generalize to unseen videos. \ref{['fig:comparison_b']} illustrates our proposed method for open-vocabulary and zero-shot action segmentation.
  • Figure 2: Open-Vocabulary Temporal Action Segmentation (OVTAS) Pipeline. Our 2-stage pipeline adopts a "segmentation by classification" approach to tackle temporal action segmentation (TAS). Stage 1, Frame–Action Embedding Similarity (FAES), generates a similarity matrix by matching frames with action labels. Stage 2, Similarity-Matrix driven Temporal Segmentation (SMTS), uses optimal transport with a temporal prior to enforce temporal consistency, producing stable action segments.
  • Figure 3: Qualitative results: columns show segmentation results of our method on GTEA, 50 Salads, and Breakfast (left to right), with two examples each.
  • Figure 4: Performance vs Model Size. Models are grouped by family (SigLIP, CLIP, OpenCLIP, PECore). Shaded regions indicate parameter size bins: Low ($\leq$400M), Large (400--800M), Huge (800--1500M), and Giant ($>$1500M).
  • Figure 5: VLM Family average of Avg metric: across datasets (GTEA, 50 Salads, Breakfast). Each line is a VLM family; per-dataset boxes list all values.