Table of Contents
Fetching ...

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

TL;DR

This paper introduces HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation by leveraging compositional structure understanding.

Abstract

Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

TL;DR

This paper introduces HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation by leveraging compositional structure understanding.

Abstract

Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.
Paper Structure (20 sections, 14 equations, 4 figures, 5 tables)

This paper contains 20 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Problem Overview.(a) Current VLMs lavila_cvpr2023 rely on instance-level contrastive learning between video & narration. HelpingHands helpinghand_iccv2023 implicitly induces object occurrence information into video features at final layer of video encoder. (b) Our proposed (HENASY) aims to assemble dynamic entities from video patches via local entity encoder, while entity-aware decoder captures interactions between entities and global context to form comprehensive video. HENASY is trained with suite of multi-grained contrastive alignments to enforce visual representations entity-level upto video-level. (c) By such compositional approach, HENASY is the first VLM that shows strong interpretability via visual grounding with both appearance/motion query types.
  • Figure 2: Overview of the HENASY framework for video-language modeling.Left: HENASY features a dual-encoder architecture with a compositional video understanding approach. The local entity encoder assembles dynamic scene entities from video patches, while the global encoder provides contextual features. These are combined in the entity-aware decoder to create an interpretable video representation. Right: HENASY is supported by a suite of multi-grained contrastive learning to enforce both entity-level and video-level representations.
  • Figure 3: Illustration of entity-aware decoder.
  • Figure 4: Vision-Language Grounding. Qualitative comparisons with HelpingHands helpinghand_iccv2023 on EgoCLIP egovlp_neurips2022. Left: comparison with a noun query obtained from narration and the pseudo-groundtruth boxes detected by 100doh_cvpr2020 for reference. Right: verb phrase in the narration is used for comparison, as verb phrase cannot be captured by 100doh_cvpr2020, we do not include pseudo boxes.