The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks
Carmela Calabrese, Stefano Berti, Giulia Pasquale, Lorenzo Natale
TL;DR
This work tackles zero-shot, multi-label action recognition in object-centric robotics by proposing Dual-VCLIP, a lightweight, prompt-based extension of vision-language models that leverages positive/negative prompts and class-specific frame aggregation to align video features with textual class descriptions. Built on OpenVCLIP and the DualCoOp framework, it trains only two prompts while keeping the rest of the model frozen, enabling efficient adaptation to new tasks with limited data. Evaluations on Charades demonstrate competitive performance in both zero-shot and fully supervised settings, and analyses of verb–object splits reveal biases and guidance for compositional generalization in robotics. The findings highlight practical implications for rapid, data-efficient learning in human–robot collaboration, and point to future work on conditioning prompts, more splits, and debiasing strategies to improve robust zero-shot/few-shot transfer.
Abstract
Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
