Table of Contents
Fetching ...

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang

TL;DR

EgoThinker addresses the gap in multimodal models' egocentric reasoning by constructing EgoRe-5M, a large-scale egocentric QA dataset with chain-of-thought and fine-grained grounding. It couples supervised fine-tuning on diverse data with reinforcement fine-tuning via GRPO using rule-based rewards to improve spatio-temporal localization and causal inference. Empirical results show state-of-the-art performance on multiple egocentric benchmarks and substantial gains in fine-grained grounding, while preserving general video understanding capabilities. This work lays a foundation for embodied, wearables-oriented AI by enabling robust first-person reasoning and precise hand–object localization in long-form egocentric videos.

Abstract

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

TL;DR

EgoThinker addresses the gap in multimodal models' egocentric reasoning by constructing EgoRe-5M, a large-scale egocentric QA dataset with chain-of-thought and fine-grained grounding. It couples supervised fine-tuning on diverse data with reinforcement fine-tuning via GRPO using rule-based rewards to improve spatio-temporal localization and causal inference. Empirical results show state-of-the-art performance on multiple egocentric benchmarks and substantial gains in fine-grained grounding, while preserving general video understanding capabilities. This work lays a foundation for embodied, wearables-oriented AI by enabling robust first-person reasoning and precise hand–object localization in long-form egocentric videos.

Abstract

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

Paper Structure

This paper contains 54 sections, 5 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of our EgoThinker. Unlike general video reasoning, egocentric video reasoning poses unique challenges because it must infer an unobservable camera wearer’s interactions and intentions. EgoThinker addresses this by curating EgoRe-5M, a large-scale egocentric reasoning dataset, and applying a two-stage supervised and reinforcement fine-tuning paradigm. This design empowers robust egocentric reasoning chat, hand–object grounding, and temporal grounding, making EgoThinker a promising foundation for wearable assistants and embodied AI.
  • Figure 2: Data Filtering Pipeline and EgoRe-5M Statistics. With our multi-stage filtering pipeline, we construct EgoRe-5M, a large-scale QA dataset to facilitate egocentric reasoning in MLLMs.
  • Figure 3: Ablations on number of frames.
  • Figure 4: Hand-object grounding visualization on EK-Visor dataset. We compare our method to baseline Qwen2-VL, GPT-4o and expert model Grounding-DINO. We utilize different prompts tailored to each model and for each image, we use "chopping board", "knife", "right hand" as query for grounding.
  • Figure 5: Temporal grounding visualization on the EgoExoLearn dataset. We compare our method to baseline Qwen2-VL and one of the strongest MLLM Gemini2.5-Pro.