Table of Contents
Fetching ...

Continual Learning Improves Zero-Shot Action Recognition

Shreyank N Gowda, Davide Moltisanti, Laura Sevilla-Lara

TL;DR

A novel method based on continual learning to address zero-shot action recognition that uses a memory of synthesized features of past classes, and combines these synthetic features with real ones from novel classes to improve generalization in unseen classes.

Abstract

Zero-shot action recognition requires a strong ability to generalize from pre-training and seen classes to novel unseen classes. Similarly, continual learning aims to develop models that can generalize effectively and learn new tasks without forgetting the ones previously learned. The generalization goals of zero-shot and continual learning are closely aligned, however techniques from continual learning have not been applied to zero-shot action recognition. In this paper, we propose a novel method based on continual learning to address zero-shot action recognition. This model, which we call {\em Generative Iterative Learning} (GIL) uses a memory of synthesized features of past classes, and combines these synthetic features with real ones from novel classes. The memory is used to train a classification model, ensuring a balanced exposure to both old and new classes. Experiments demonstrate that {\em GIL} improves generalization in unseen classes, achieving a new state-of-the-art in zero-shot recognition across multiple benchmarks. Importantly, {\em GIL} also boosts performance in the more challenging generalized zero-shot setting, where models need to retain knowledge about classes seen before fine-tuning.

Continual Learning Improves Zero-Shot Action Recognition

TL;DR

A novel method based on continual learning to address zero-shot action recognition that uses a memory of synthesized features of past classes, and combines these synthetic features with real ones from novel classes to improve generalization in unseen classes.

Abstract

Zero-shot action recognition requires a strong ability to generalize from pre-training and seen classes to novel unseen classes. Similarly, continual learning aims to develop models that can generalize effectively and learn new tasks without forgetting the ones previously learned. The generalization goals of zero-shot and continual learning are closely aligned, however techniques from continual learning have not been applied to zero-shot action recognition. In this paper, we propose a novel method based on continual learning to address zero-shot action recognition. This model, which we call {\em Generative Iterative Learning} (GIL) uses a memory of synthesized features of past classes, and combines these synthetic features with real ones from novel classes. The memory is used to train a classification model, ensuring a balanced exposure to both old and new classes. Experiments demonstrate that {\em GIL} improves generalization in unseen classes, achieving a new state-of-the-art in zero-shot recognition across multiple benchmarks. Importantly, {\em GIL} also boosts performance in the more challenging generalized zero-shot setting, where models need to retain knowledge about classes seen before fine-tuning.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: GIL at a glance. The main idea is to use a Replay Memory replay from Continual Learning to fine-tune a video model $\mathcal{M}$. The Replay Memory consists of a buffer which contains classes from the pre-training and fine-tuning datasets. Samples in the buffer are generated with a semantic-to-visual encoder $E$ and a feature generator $\mathcal{F}$.
  • Figure 2: Overview of GIL. The initialization stage involves booting the replay memory, i.e., storing pre-training class prototypes and noise obtained averaging video features encoded by $\mathcal{M}$. $E$ is then trained to produce prototypes given a semantic embedding, and $\mathcal{F}$ is trained to generate visual features from the output of $E$. In the incremental learning stage we fine-tune $\mathcal{M}$ with a mix of synthetic and real features. Synthetic features are generated from the memory buffer, while real features are obtained with the backbone of $\mathcal{M}$. Real features are added gradually sampling a subset of new classes at the time. In the update stage we add prototypes and noise for the new classes to the buffer and fine-tune $E$ with this new data.
  • Figure 3: Testing pipeline. (1) We project unseen classes fine-tuning $\mathcal{M}$ with synthetic features. This is done by first feeding $E$ a class semantic embedding. $E$ then outputs a class prototype and noise, which are fed to $\mathcal{F}$ to generate the visual features we use to fine-tune $\mathcal{M}$. (2) After fine-tuning, given a test video instance we perform a nearest neighbor (NN) search to predict class $\hat{y}$.
  • Figure 4: Comparing real (left) vs real (right) features using t-SNE projections. We show the embeddings obtained from 10 random classes in HMDB-51. Synthetic features appear more compact than real features, which is beneficial to train the video model. We believe this is the reason why using exclusively synthetic features for the pre-training classes works better than using real features or a mix of the two.
  • Figure 5: Test accuracy versus training completion percentage.