The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Carmela Calabrese; Stefano Berti; Giulia Pasquale; Lorenzo Natale

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Carmela Calabrese, Stefano Berti, Giulia Pasquale, Lorenzo Natale

TL;DR

This work tackles zero-shot, multi-label action recognition in object-centric robotics by proposing Dual-VCLIP, a lightweight, prompt-based extension of vision-language models that leverages positive/negative prompts and class-specific frame aggregation to align video features with textual class descriptions. Built on OpenVCLIP and the DualCoOp framework, it trains only two prompts while keeping the rest of the model frozen, enabling efficient adaptation to new tasks with limited data. Evaluations on Charades demonstrate competitive performance in both zero-shot and fully supervised settings, and analyses of verb–object splits reveal biases and guidance for compositional generalization in robotics. The findings highlight practical implications for rapid, data-efficient learning in human–robot collaboration, and point to future work on conditioning prompts, more splits, and debiasing strategies to improve robust zero-shot/few-shot transfer.

Abstract

Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 3 figures, 4 tables)

This paper contains 14 sections, 9 equations, 3 figures, 4 tables.

INTRODUCTION
Related work
Multi-label action recognition
Multi-label zero-shot action recognition
Compositionality in action recognition
Background
Problem definition
Method
Experiments
Results
Comparison with SOTA
Compositionality
Spatial vs Temporal generalization
CONCLUSIONS

Figures (3)

Figure 1: Overview of our proposed Dual-VCLIP for multi-modal multi-label action recognition. It has two main components: a video encoder, a textual encoder. DualCoOp learns a pair of positive and negative prompts to quickly adapt pretrained vision-text encoders to the Multi Label Recognition (MLR) task. For each class, two prompts generate two contrastive (positive and negative) textual embeddings as the input to the text encoder. Furthermore, we propose Class-Specific Frame Feature Aggregation to first project each frame’s feature to the textual space and then aggregate the temporal logits by the magnitude of class-specific semantic responses. During training, we apply the asymmetric loss from sun2022dualcoop to optimize learnable prompts while keeping other network components frozen.
Figure 2: Inference per-frame with our method Dual-VCLIP on two examples from the Charades' test set. Underlined classes are unseen- they did not appear at training time. Classes that appears only in one frame are discarded.
Figure 3: Confusion matrix. (A) Sub-matrix representing an object-cluster (e.g., 'bag'). (B) Sub-matrix representing a verb-cluster (e.g., 'closing'). For these examples, we binarized the predicted outputs with a threshold equal to 0.5 on the confidence value.

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

TL;DR

Abstract

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)