OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

Lala Shakti Swarup Ray; Bo Zhou; Sungho Suh; Paul Lukowicz

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz

TL;DR

This work tackles open-world recognition of human-to-human interactions by proposing OV-HHIR, an open-vocabulary framework that leverages large language models to generate descriptive text and align video and text embeddings. It introduces HHIRChat, a unified 103-category HHIR dataset with $86{,}623$ sequences created by merging existing benchmarks and converting labels into soft, descriptive prompts. The architecture uses three vision-language branches for two interacting individuals and the background, with segmentation provided by Track Anything and feature extraction via ViTPose and ViT, all integrated through LLaMA 2 13B Chat to produce open-ended interaction descriptions. Experimental results show OV-HHIR surpassing fixed-vocabulary and existing cross-modal video models in cosine similarity and Macro-F1 across diverse datasets, and demonstrate open-world capabilities with unseen interactions; however, high memory requirements and potential generation errors suggest avenues for future optimization and real-time deployment.

Abstract

Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and combining existing public human interaction datasets into a unified benchmark. Extensive experiments demonstrate that our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding, setting the stage for more intelligent and adaptable visual understanding systems in surveillance and beyond.

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

TL;DR

sequences created by merging existing benchmarks and converting labels into soft, descriptive prompts. The architecture uses three vision-language branches for two interacting individuals and the background, with segmentation provided by Track Anything and feature extraction via ViTPose and ViT, all integrated through LLaMA 2 13B Chat to produce open-ended interaction descriptions. Experimental results show OV-HHIR surpassing fixed-vocabulary and existing cross-modal video models in cosine similarity and Macro-F1 across diverse datasets, and demonstrate open-world capabilities with unseen interactions; however, high memory requirements and potential generation errors suggest avenues for future optimization and real-time deployment.

Abstract

Paper Structure (9 sections, 3 figures, 3 tables)

This paper contains 9 sections, 3 figures, 3 tables.

Introduction
Proposed Method
HHIRChat Dataset
OV-HHIR Architecture
Experimental Results
Quantitative Evaluation
Qualitative Evaluation
Limitations
Conclusion

Figures (3)

Figure 1: HHIRChat data format that mixes cross-modal tokens.
Figure 2: Overview of the proposed OV-HHIR framework that uses three video language branches and LLaMA 2 13B chat to generate open-ended natural language descriptions of human-to-human interactions from dynamic video sequences. During training, Track anything, ViTPose, ViT and LLaMA 2 13B Chat have frozen weights, while the different instances of Q-Former and Linear layers have learnable weights.
Figure 3: Examples showing the ground-truth vs predictions by Vanilla VideoLLaMA 2, VideoLLaMA 2 LoRA, and OV-HHIR.

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

TL;DR

Abstract

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)