Table of Contents
Fetching ...

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

Yuejiao Su, Yi Wang, Qiongyang Hu, Chuang Yang, Lap-Pui Chau

TL;DR

This work introduces Ego-IRG, a unified task that jointly analyzes, answers, and pixel-grounding egocentric interactions given a text query. To support this task, the authors build Ego-IRGBench, a large-scale RGB-D dataset with over 1.6 million query-answer-mask pairs spanning 20k+ images, depth maps, and interaction descriptions. They propose ANNEXE, a depth-aware architecture that couples a text-generation module (driven by multimodal language models) with a depth-guided mask-generation module to produce fluent descriptions, targeted answers, and fine-grained segmentation masks in response to queries. Empirical results show that ANNEXE achieves state-of-the-art performance on both textual outputs and pixel grounding on Ego-IRGBench, with depth supervision providing notable gains in grounding accuracy and overall task performance, highlighting the approach’s potential for flexible downstream egocentric understanding.

Abstract

Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

TL;DR

This work introduces Ego-IRG, a unified task that jointly analyzes, answers, and pixel-grounding egocentric interactions given a text query. To support this task, the authors build Ego-IRGBench, a large-scale RGB-D dataset with over 1.6 million query-answer-mask pairs spanning 20k+ images, depth maps, and interaction descriptions. They propose ANNEXE, a depth-aware architecture that couples a text-generation module (driven by multimodal language models) with a depth-guided mask-generation module to produce fluent descriptions, targeted answers, and fine-grained segmentation masks in response to queries. Empirical results show that ANNEXE achieves state-of-the-art performance on both textual outputs and pixel grounding on Ego-IRGBench, with depth supervision providing notable gains in grounding accuracy and overall task performance, highlighting the approach’s potential for flexible downstream egocentric understanding.

Abstract

Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.

Paper Structure

This paper contains 18 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Ego-IRG task versus other related tasks. The Ego-IRG task advances the interaction understanding compared with other related tasks such as Referring Image Segmentation (RIS), Egocentric Hand-Object Interaction detection (EHOI), and Egocentric Question Answering (EgoVQA).
  • Figure 2: Overall architecture of the proposed ANNEXE model, a synergy of text generation and mask generation modules.
  • Figure 3: Illustration of the proposed Ego-IRGBench dataset, which includes (a)(i) the egocentric image, (a)(ii) the depth map, and interaction description. It also includes (b) single-, (c) multi-, and (d) no-target samples with the corresponding query, description (Des.), and answer (Ans.).