Table of Contents
Fetching ...

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li, Xiao-Jun Wu, Muhammad Awais, Sara Atito, Josef Kittler

TL;DR

This work defines Zero-Shot Compositional Action Recognition (ZS-CAR) for video, where unseen actions are composed from seen verbs and objects. It introduces the Something-composition (Sth-com) benchmark derived from Something-Something V2 to evaluate this capability and proposes the Component-to-Composition (C2C) learning framework, which first learns independent verb and object components and then infers actions via two composition pathways, reinforced by an enhanced training strategy addressing component domain and compatibility variations. Key contributions include a formal ZS-CAR problem formulation, the Sth-com benchmark with comprehensive statistics and data-splitting strategy, and the C2C baseline with an augmented training regime that uses HSIC-based independence, conditional constraints, and CutMix-based imagination of unseen compositions to balance seen and unseen performance. Experimental results demonstrate state-of-the-art performance on Sth-com across various backbones and show strong improvements over adapted CZSL and CLIP-based methods, underscoring the method’s effectiveness for video-based compositional generalization and its potential impact on open-set action understanding and robotics.

Abstract

Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring so-called compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variations between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

TL;DR

This work defines Zero-Shot Compositional Action Recognition (ZS-CAR) for video, where unseen actions are composed from seen verbs and objects. It introduces the Something-composition (Sth-com) benchmark derived from Something-Something V2 to evaluate this capability and proposes the Component-to-Composition (C2C) learning framework, which first learns independent verb and object components and then infers actions via two composition pathways, reinforced by an enhanced training strategy addressing component domain and compatibility variations. Key contributions include a formal ZS-CAR problem formulation, the Sth-com benchmark with comprehensive statistics and data-splitting strategy, and the C2C baseline with an augmented training regime that uses HSIC-based independence, conditional constraints, and CutMix-based imagination of unseen compositions to balance seen and unseen performance. Experimental results demonstrate state-of-the-art performance on Sth-com across various backbones and show strong improvements over adapted CZSL and CLIP-based methods, underscoring the method’s effectiveness for video-based compositional generalization and its potential impact on open-set action understanding and robotics.

Abstract

Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring so-called compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variations between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.
Paper Structure (15 sections, 10 equations, 7 figures, 8 tables)

This paper contains 15 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Zero-Shot Compositional Action Recognition (ZS-CAR) requires models to recognize unseen actions composed of verbs and objects observed in seen actions.
  • Figure 2: The proposed Component-to-Composition (C2C) learning framework. C2C first aligns verb/object prototypes with corresponding visual features to obtain component scores in the Independent Component Learning module. Then, actions are inferred through two paths (dynamics and static) in the Component to Composition module. In the dynamics path, verb prototypes and visual features are used to compute conditional object scores. Then the independent verb scores and conditional object scores are multiplied to gain the action scores. The static path follows a similar procedure. The final output is a consensus of the results from both paths.
  • Figure 3: Component domain variations. For unseen actions, an object may exhibit different appearances (left). To deal with this (right), we reduce the spurious information and enlarge the independence between component-specific features.
  • Figure 4: Component compatibility variations. The component relations are different between the training and test sets (left). To solve this (right), we use the observed conditional score to fit seen relations and encourage the model to imagine unseen actions to avoid being limited to seen relations.
  • Figure 5: Seen-unseen curves, drawn by gradually forcing the model biased from unseen to seen actions.
  • ...and 2 more figures