Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Kaiting Liu; Hazel Doughty

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Kaiting Liu, Hazel Doughty

TL;DR

The paper tackles the rigidity of fixed taxonomies in video understanding by introducing category splitting, a zero-shot editing framework that refines coarse labels into fine-grained subcategories while preserving existing predictions. It leverages latent compositional structure in video classifiers through modifier retrieval and alignment to edit only the classification head, with low-shot finetuning providing additional gains. The proposed SSv2-Split and FineGym-Split benchmarks demonstrate that zero-shot and low-shot edits substantially outperform vision-language baselines in generality while maintaining near-perfect locality. This work highlights the latent compositionality in video backbones and offers a practical, data-efficient path to adapting taxonomy granularity in specialized domains.

Abstract

Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

TL;DR

Abstract

Paper Structure (27 sections, 13 equations, 7 figures, 13 tables)

This paper contains 27 sections, 13 equations, 7 figures, 13 tables.

Introduction
Category Splitting Problem Definition
Zero-Shot Category Splitting
Zero-Shot Editing: Modifier Retrieval
Zero-Shot Editing: Modifier Alignment
Low-shot Category Splitting
Category Splitting Datasets
Experiments & Results
Implementation Details.
Evaluation Metrics.
Comparative Results
Ablation Study
Analysis
Qualitative Results.
Related Work
...and 12 more sections

Figures (7)

Figure 1: Category splitting aims to edit a trained video classifier by dividing a coarse label into multiple fine-grained subcategories, while keeping all other predictions unchanged. The challenge is to achieve this without retraining the full model and with zero or very few labels.
Figure 2: Zero-shot Category Splitting. Given a trained video classifier, our goal is to split a coarse category (e.g. pushing) into fine-grained subcategories without any video data. Modifier retrieval first exposes compositional structure in the video classifier's classification head from which it builds a dictionary of modifier vectors. The classifier is then edited by retrieving the appropriate modifier vector and adding it to the coarse category's weight vector to create a new fine-grained subcategory. To generalize to unseen modifiers, modifier alignment learns a lightweight mapping from modifier text to modifier vectors, using category text/weight vectors as additional supervision.
Figure 3: Low-shot Category Splitting. We edit the model by replacing the coarse category $c$ with $\theta'_{head}$. This head is fine-tuned with as little as one video per fine-grained subcategory, initialized with our zero-shot approach.
Figure 4: Zero-shot Ablation. Our modifier retrieval and alignment greatly improve generality by mining modifier vectors from the video-only classifier.
Figure 5: Analysis over different category splits. (a) Locality decrease slightly with more subcategories in the split, while generality shows no trend. (b) Performance is highest for direction-based splits and lowest for differences in object count, intent/success, and object interactions. (c) Existing analogous categories with the same modifier help, but our approach remains effective without them.
...and 2 more figures

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

TL;DR

Abstract

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)