Table of Contents
Fetching ...

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel

TL;DR

This work proposes EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure, and achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B by over 10 points.

Abstract

Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

TL;DR

This work proposes EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure, and achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B by over 10 points.

Abstract

Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
Paper Structure (14 sections, 3 equations, 6 figures, 2 tables)

This paper contains 14 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of 4D Reasoning tasks. We distinguish between fixtures (static objects in the scene) and objects (moving entities) to evaluate complex spatial-temporal understanding (from perrett2025hd). For each task, we display six frames sampled from the original video. These tasks require the model to reason about fixture locations using "clock-face" orientations, track object movement itineraries, and count interactions despite constant ego-motion. Correct answers are indicated in green bolded text. Some options are omitted due to space limit.
  • Figure 2: Automated Metadata-Driven Pipeline for QA and CoT Generation. We first preprocess the video data by (a) extracting precise 2D & 3D object trajectories via video object detection and SLAM-based camera alignment; and (b) merging these with Gemini-refined text narrations to create unified 4D Descriptions. These descriptions are processed through Task-Adaptive Thinking Templates to generate grounded QA pairs and CoT traces.
  • Figure 3: Task-Adaptive Thinking Templates. Our templates decompose 4D reasoning into structured sub-steps. Yellow highlights indicate grounded entity names (fixtures and objects) / timestamps, while red highlights denote specific spatial-temporal metadata, such as angular orientations, and trajectory segments.
  • Figure 4: Overview of the Training Paradigm of EgoReasoner. Stage I Structured Cold-Start (SFT): The MLLM is fine-tuned to imitate structured reasoning traces. It learns to generate template-based <think> blocks that anchor entities/timestamps and identify fixture prior to providing the final <answer>. Stage II: Grounded Reinforcement Fine-Tuning (RFT): The model is optimized via GRPO to ensure physical verifiability.
  • Figure 5: Impact of task-aware rewards during RFT. We compare standard RFT (without any task-aware rewards) against variants using Grounding, Logic, and Combined rewards across six tasks.
  • ...and 1 more figures