Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

Zhuyu Teng; Pei Chen; Yichen Cai; Ruoqing Lu; Zhaoqu Jiang; Jiayang Li; Weitao You; Lingyun Sun

Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

Zhuyu Teng, Pei Chen, Yichen Cai, Ruoqing Lu, Zhaoqu Jiang, Jiayang Li, Weitao You, Lingyun Sun

Abstract

Despite advances in multimodal AI, current vision-based assistants often remain inefficient in collaborative tasks. We identify two key gulfs: a communication gulf, where users must translate rich parallel intentions into verbal commands due to the channel mismatch , and an understanding gulf, where AI struggles to interpret subtle embodied cues. To address these, we propose Eye2Eye, a framework that leverages first-person perspective as a channel for human-AI cognitive alignment. It integrates three components: (1) joint attention coordination for fluid focus alignment, (2) revisable memory to maintain evolving common ground, and (3) reflective feedback allowing users to clarify and refine AI's understanding. We implement this framework in an AR prototype and evaluate it through a user study and a post-hoc pipeline evaluation. Results show that Eye2Eye significantly reduces task completion time and interaction load while increasing trust, demonstrating its components work in concert to improve collaboration.

Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

Abstract

Paper Structure (41 sections, 2 equations, 10 figures, 3 tables)

This paper contains 41 sections, 2 equations, 10 figures, 3 tables.

Introduction
Related Work
Wearable AI Assistants for Situated Guidance
Perception from First-Person Perspective
Cognitive Alignment in Human-AI Collaboration
Concept of Eye2Eye: A First-Person Perspective Framework for Cognitive Alignment
Identifying Core Challenges
Design Requirements
Core Components
Component I: Joint Attention Coordination (See + Focus)
Component II: Accumulated Common Ground (Understand + Memorize)
Component III: Reflective Situated Feedback (Act + Reflect)
AR Prototype System
Always-On Perception and Event-Driven Triggering (for Component I)
Object-Card Memory Construction and Updating (for Component II)
...and 26 more sections

Figures (10)

Figure 1: Framework overview of Eye2Eye. The framework illustrates the bidirectional cognitive loop between humans (left) and AI (right), mediated by three core components. The process flows from (1) joint attention, aligning and interpreting the focus of attention, to (2) common ground, enriching and updating shared memory, and then to (3) situated feedback, acting and reflecting based on context. This feedback loop, in turn, updates the shared memory and can proactively generate new attention cues.
Figure 2: Bidirectional attention in a shared first-person perspective: AI captures and infers human's attention from explicit and implicit cues, subsequently aligning their attention and providing guidance through multimodal feedback.
Figure 3: The technical pipeline of the implemented prototype: (1) attention trigger and interaction event understanding; (2) retrieving and revising common ground for cognitive alignment; and (3) executing the feedback strategy.
Figure 4: Details of the dynamic, self-correcting memory unit structure: each analysis of the current state ($c_{state}$) retrieves existing object cards to either create a new common ground unit or update an existing one, enabling persistent accumulation. Updates to the AI response are based on reflection of user feedback from interactions, achieving cognitive refinement.
Figure 5: Mapping of situation types to feedback strategy modalities and categories, with representative examples for each case.
...and 5 more figures

Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

Abstract

Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

Authors

Abstract

Table of Contents

Figures (10)